Package: YPInterimTesting Type: Package Title: Interim Monitoring Using Adaptively Weighted Log-Rank Test in
Clinical Trials Version: 0.1.0 Author: Daewoo Pak and Song Yang Maintainer: Daewoo Pak <heavyrain.pak@gmail.com> Description: Provide monitoring boundaries for interim testing using the
adaptively weighted log-rank test developed by
Yang and Prentice (2010 <doi:10.1111/j.1541-0420.2009.01243.x>).
The package use a re-sampling method to obtain stopping boundaries
in sequential designs. The output consists of stopping boundaries at
the interim looks along with nominal p-values defined as the
probability of the test exceeding the specific observed value or
critical value, regardless of the test behavior at other looks.
The asymptotic distribution of the test statistics of the adaptively
weighted log-rank test at the interim looks is examined in
Yang (2017, pre-print). License: GPL (>= 3) Encoding: UTF-8 LazyData: true RoxygenNote: 6.0.1 Imports: Rcpp LinkingTo: Rcpp NeedsCompilation: yes Packaged: 2017-11-17 18:54:40 UTC; daewoo Repository: CRAN Date/Publication: 2017-11-17 19:05:19 UTC

Package: spotifyr Title: Pull Track Audio Features from the 'Spotify' Web API Version: 1.0.0 Authors@R: person("Charlie", "Thompson", email = "charles.thompson@barcelonagse.eu", role = c("aut", "cre")) Description: A wrapper for pulling track audio features from the
'Spotify' Web API <http://developer.spotify.com/web-api> in bulk.
By automatically batching API requests, it allows you to enter an artist's
name and retrieve their entire discography in seconds, along with audio
features and track/album popularity metrics. You can also pull song and
playlist information for a given 'Spotify' user (including yourself!). Depends: R (>= 3.3.3) Imports: dplyr, purrr, tidyr, httr, lubridate License: MIT + file LICENSE Encoding: UTF-8 LazyData: true RoxygenNote: 6.0.1 Suggests: testthat URL: http://github.com/charlie86/spotifyr BugReports: http://github.com/charlie86/spotifyr/issues NeedsCompilation: no Packaged: 2017-11-17 15:28:19 UTC; chuck Author: Charlie Thompson [aut, cre] Maintainer: Charlie Thompson <charles.thompson@barcelonagse.eu> Repository: CRAN Date/Publication: 2017-11-17 18:24:19 UTC

The Fleming-Harrington class for right-censored data was first introduced by Harrington and Fleming (1982). This class is widely used in survival analysis studies and it is a subset of the so-called weighted logrank test statistics. Recently, Oller and Gómez (2012) proposed an extension of this class for interval-censored data. This paper introduces the R package FHtest, which implements the Fleming-Harrington class for right-censored and interval-censored survival data. It provides an integrated approach for performing two-sample, k-sample and trend tests based on either counting process theory, likelihood theory, or permutation distributions. In this paper, we summarize the main aspects of the theory framework and present several examples with R codes to illustrate the usage of the main functions of FHtest.

In spite of the interest in and appeal of convolution-based approaches for nonstationary spatial modeling, off-the-shelf software for model fitting does not as of yet exist. Convolution-based models are highly flexible yet notoriously difficult to fit, even with relatively small data sets. The general lack of pre-packaged options for model fitting makes it difficult to compare new methodology in nonstationary modeling with other existing methods, and as a result most new models are simply compared to stationary models. Using a convolution-based approach, we present a new nonstationary covariance function for spatial Gaussian process models that allows for efficient computing in two ways: first, by representing the spatially-varying parameters via a discrete mixture or "mixture component" model, and second, by estimating the mixture component parameters through a local likelihood approach. In order to make computations for a convolutionbased nonstationary spatial model readily available, this paper also presents and describes the convoSPAT package for R. The nonstationary model is fit to both a synthetic data set and a real data application involving annual precipitation to demonstrate the capabilities of the package.

Latent class is a method for classifying subjects, originally based on binary outcome data but now extended to other data types. A major difficulty with the use of latent class models is the presence of heterogeneity of the outcome probabilities within the true classes, which violates the assumption of conditional independence, and will require a large number of classes to model the association in the data resulting in difficulties in interpretation. A solution is to include a normally distributed subject level random effect in the model so that the outcomes are now conditionally independent given both the class and random effect. A further extension is to incorporate an additional period level random effect when subjects are observed over time. The use of the randomLCA R package is demonstrated on three latent class examples: classification of subjects based on myocardial infarction symptoms, a diagnostic testing approach to comparing dentists in the diagnosis of dental caries and classification of infants based on respiratory and allergy symptoms over time.

The non-parametric maximum likelihood estimator and semi-parametric regression models are fundamental estimators for interval censored data, along with standard fullyparametric regression models. The R package icenReg is introduced which contains fast, reliable algorithms for fitting these models. In addition, the package contains functions for imputation of the censored response variables and diagnostics of both regression effects and baseline distribution.

R has excellent facilities for dealing with both dates and datetime objects.
For datetime objects, the POSIXt time type can be mapped to POSIXct and
its representation of fractional seconds since the January 1, 1970 “epoch” as
well as to the broken-out list representation in POSIXlt. Many add-on
packages use these facilities.

POSIXct uses a double type to provide 53 bits of resolution. That is generally
good enough for timestamps down to just above a microsecond, and has served
the R community rather well.

But increasingly, time increments are measured in nanoseconds. Other languages uses a (signed)
64-bit integer to represent (integer) nanoseconds since the epoch. A bit over a year ago I realized
that we have this in R too—by combining the integer64 type in the very nice
bit64 package by Jens Oehlschlaegel with the
CCTZ-based parser and formatter in my own
RcppCCTZ package. And thus the
nanotime package was created.

Since then, Leonardo Silvestri joined in and significantly enhanced
nanotime by redoing it as a S4 class.

A simple example:

[1] "1970-01-01T00:00:00.000000042+00:00"

Here we used a single element with value 42, and created a nanotime vector from it—which is
taken to me 42 nanoseconds since the epoch, or basically almost at January 1, 1970. See the
nanotime page and package for more.

Step 1: Large Integer Types

So more recently I had a need to efficiently generate such (many such) integer vectors from
int64_t data. Both Leonardo and Dan
helped with initial discussions and tests. One can either use a reinterpret_cast<>, or a straight
memcpy as the key trick in bit64 is to (re-)use the
underlying int64_t representation (which we do not have in R) via the 64-bit double
representative. Just never access it as a double. So we have the space, we just need to ensure we
copy the bits (i.e. actual binary content) rather than their value (when “mapped” to a type).
This leads to the following function to create an integer64 vector for use in R at the C++ level:

This uses the standard trick of setting a class attribute to set an S3 class. Now the values in
v will return to R (exactly how is treated below), and R will treat the vector as integer64
object (provided the bit64 package has been loaded).

As mentioned, reinterpret_cast<>() can be used too, but leads to a compiler warning (under
g++-6). Per Matt’s excellent compiler explorer, both
approaches lead to the same mov semantics, so we prefer the variant that does not yell at us.

Step 2: Nanotime

A nanotime vector is creating using an internal integer64 vector. So the previous function
almost gets us there. But we need to set the S4 type correctly. So that needed some extra
work—and the following function seems to do it right:

This creates a nanotime vector as a proper S4 object. As before, we set some class attributes
(though in a nested fashion as S4 is that fancy) and also invoke one R macro.

Step 3: Returning them R via data.table

The astute reader will have noticed that neither one of the functions presented so far had an
Rcpp::export tag. This is because of their function argument: int64_t is not representable
natively by R, which is why we need a workaround.

Matt Dowle has been very helpful in providing excellent support for
nanotime in data.table (even after we, ahem,
borked it by switching from S3 to S4). This support was of course relatively straightforward
because data.table already had support for the
underlying integer64, and we had the additional formatters etc.

Example

The following example shows the output from the preceding function:

The tenth (!!) annual annual R/Finance conference will take in Chicago on the UIC campus on June 1 and 2, 2018. Please see the call for papers below (or at the website) and consider submitting a paper.

We are once again very excited about our conference, thrilled about who we hope may agree to be our anniversary keynotes, and hope that many R / Finance users will not only join us in Chicago in June -- and also submit an exciting proposal.

So read on below, and see you in Chicago in June!

Call for Papers

R/Finance 2018: Applied Finance with R
June 1 and 2, 2018
University of Illinois at Chicago, IL, USA

The tenth annual R/Finance conference for applied finance using R will be held June 1 and 2, 2018 in Chicago, IL, USA at the University of Illinois at Chicago. The conference will cover topics including portfolio management, time series analysis, advanced risk tools, high-performance computing, market microstructure, and econometrics. All will be discussed within the context of using R as a primary tool for financial risk management, portfolio construction, and trading.

Over the past nine years, R/Finance has includedattendeesfrom around the world. It has featured presentations from prominent academics and practitioners, and we anticipate another exciting line-up for 2018.

We invite you to submit complete papers in pdf format for consideration. We will also consider one-page abstracts (in txt or pdf format) although more complete papers are preferred. We welcome submissions for both full talks and abbreviated "lightning talks." Both academic and practitioner proposals related to R are encouraged.

All slides will be made publicly available at conference time. Presenters are strongly encouraged to provide working R code to accompany the slides. Data sets should also be made public for the purposes of reproducibility (though we realize this may be limited due to contracts with data vendors). Preference may be given to presenters who have released R packages.

Please submit proposals online at http://go.uic.edu/rfinsubmit. Submissions will be reviewed and accepted on a rolling basis with a final submission deadline of February 2, 2018. Submitters will be notified via email by March 2, 2018 of acceptance, presentation length, and financial assistance (if requested).

Financial assistance for travel and accommodation may be available to presenters. Requests for financial assistance do not affect acceptance decisions. Requests should be made at the time of submission. Requests made after submission are much less likely to be fulfilled. Assistance will be granted at the discretion of the conference committee.

Additional details will be announced via the conference website at http://www.RinFinance.com/ as they become available. Information on previous years'presenters and their presentations are also at the conference website. We will make a separate announcement when registration opens.

For the program committee:

Gib Bassett, Peter Carl, Dirk Eddelbuettel, Brian Peterson,
Dale Rosenthal, Jeffrey Ryan, Joshua Ulrich

A shiny new (mostly-but-not-completely maintenance) release of RQuantLib, now at version 0.4.4, arrived on CRAN overnight, and will get to Debian shortly. This is the first release in over a year, and it it contains (mostly) a small number of fixes throughout. It also includes the update to the new DateVector and DatetimeVector classes which become the default with the upcoming Rcpp 0.12.14 release (just like this week's RcppQuantuccia release). One piece of new code is due to François Cocquemas who added support for discrete dividends to both European and American options. See below for the complete set of changes reported in the NEWS file.

As with release 0.4.3 a little over a year ago, we will not have new Windows binaries from CRAN as I apparently have insufficient powers of persuasion to get CRAN to update their QuantLib libraries. So we need a volunteer. If someone could please build a binary package for Windows from the 0.4.4 sources, I would be happy to once again host it on the GHRR drat repo. Please contact me directly if you can help.

Changes are listed below:

Changes in RQuantLib version 0.4.4 (2017-11-07)

Changes in RQuantLib code:

Equity options can now be analyzed via discrete dividends through two vectors of dividend dates and values (Francois Cocquemas in #73 fixing #72)

Some package and dependency information was updated in files DESCRIPTION and NAMESPACE.

The new Date(time)Vector classes introduced with Rcpp 0.12.8 are now used when available.

Minor corrections were applied to BKTree, to vanilla options for the case of intraday time stamps, to the SabrSwaption documentation, and to bond utilities for the most recent QuantLib release.

If you have been following this blog, you may have noticed that I don't have any update for more than a year now. The reason is that I've been busy with my research, my work, and I promised not to share anything here until I finished my degree (Master of Science in Statistics). Anyways, at this point I think it's time to share with you what I've learned in the past year. So far, it's been a good year for Statistics especially in the Philippines, in fact, last November 15, 2016, the team of local data scientists made a huge step in Big data by organizing the first ever conference on this topic. Also months before that, the 13th National Convention on Statistics organized by the Philippine Statistics Authority, invited a keynote speaker from Paris21 to tackle Big data and its use in the government.

So without further ado, in this post, I would like to share a new programming language which I've used for several months now, and it's called Julia. This programming language is by far my favorite, it's a well-thought-out language as many would say, for many reasons. The first of course is the speed, second is the grammar, and many more. I can't list them down here, but I suggest you visit the official website, and try it for yourself.

Installation

The installation of this program is straightforward, simply go to the Julia's official download page and download the binaries for your operating system. Alternatively, you can install Julia by downloading the JuliaPro from the Julia Computing products. This will setup everything you need, which include the Github Atom Editor out of the box. After installation, the first time you load the command-line-version program, you'll have the following window:

Working with the command-line-version is actually fun, and personally I think Julia has the best command-line-version compared to R and Python in terms of features. For example, you can shift to shell mode by simply pressing ; in the Julia prompt, and using ? to activate the help mode. It also has autocompletion by pressing Tab after entering first few letters of the syntax, the LaTeX UTF autocompletion is also one of the best features, and almost any symbols/characters can be used as variables, like emoticon as shown below:

To install the Jupyter notebook, simply run the following codes: In the screenshot above, I tweaked the theme of the notebook using the script from this repository. As mentioned, to setup Julia in Github Atom Editor, I recommend downloading the JuliaPro or you can follow the instruction in the Juno Lab website. After installation, you can add Atom Extensions like Minimap, which is not available by default, and in case you are interested, the syntax highlighter I used in the screenshot is the Gruvbox Plus.

Further, to setup Julia in Microsoft Visual Studio Code, open the program, press Ctrl+P, paste ext install language-julia and hit Enter. This will install the Julia extension for Visual Studio Code. After installation, you can load the Julia REPL by pressing the following keys Ctrl+Shift+P (Windows) or Cmd+Shift+P (Mac) and enter julia start repl, and press Enter. If there is an error, the path may need to be specified properly. To do this, go to Preferences > Settings. Then in the .json user file settings, enter the following: Of course, you need to check the path properly by replacing the Julia-0.6.0-rc3 (Windows) or Julia-0.6.app (Mac) with the desired version of your Julia, and the C:/Users/MyName with your desired path. Further, I use the following setting in my .json file to adjust my Minimap similar to the screenshot above. Lastly, to toggle the cursor's focus between the script pane and the integrated Julia terminal using Ctrl+`, I use the following Keybindings (go to Preferences > Keyboard Shortcuts > keybindings.json). For more on this topic visit the official github page. The three editors above have advantages and disadvantages. However, my primary editor is the Visual Studio Code, because it is fast and loaded with features as well. The major limitation of this editor is the LaTeX UTF autocompletion, which is available for Atom Editor. But there are third party packages like Unicode LaTeX, that can do the job indirectly, or alternatively you can generate the LaTeX UTF using the console (the integrated Julia terminal in the Visual Studio Code), but I think this is not a big deal, and may be in the near future, this capability will be added. On the other hand, the Atom Editor has of course more features for Julia, for example the plot pane, the workspace, and many more. The only problem is that, it's kind of slow especially when working with several datasets in your workspace, plus plots, plus very long lines of codes, scrolling through it is not smooth. Nevertheless, let's be positive and hope that more improvements are coming to these editors. Finally, for those who want to start using Julia, visit the Official Documentation and Learning Materials; ask questions on Julia Discourse and join the Julia Gitter.

A first maintenance release of RcppQuantuccia got to CRAN earlier today.

RcppQuantuccia brings the Quantuccia header-only subset / variant of QuantLib to R. At present it mostly offers calendaring, but Quantuccia just got a decent amount of new functions so hopefully we can offer more here too.

This release was motivated by the upcoming Rcpp release which will deprecate the okd Date and Datetime vectors in favours of newer ones. So this release of RcppQuantuccia switches to the newer ones.

Other changes are below:

Changes in version 0.0.2 (2017-11-06)

Added calendars for Canada, China, Germany, Japan and United Kingdom.

A maintenance release of our pinp package for snazzier one or two column vignettes is now on CRAN as of yesterday.

In version 0.0.3, we disabled the default \pnasbreak command we inherit from the PNAS LaTeX style. That change turns out to have been too drastic. So we reverted yet added a new YAML front-matter option skip_final_break which, if set to TRUE, will skip this break. With a default value of FALSE we maintain prior behaviour.

A screenshot of the package vignette can be seen below. Additional screenshots of are at the pinp page.

Adaptive enrichment designs involve preplanned rules for modifying patient enrollment criteria based on data accrued in an ongoing trial. These designs may be useful when it is suspected that a subpopulation, e.g., defined by a biomarker or risk score measured at baseline, may benefit more from treatment than the complementary subpopulation. We compare two types of such designs, for the case of two subpopulations that partition the overall population. The first type starts by enrolling the subpopulation where it is suspected the new treatment is most likely to work, and then may expand inclusion criteria if there is early evidence of a treatment benefit. The second type starts by enrolling from the overall population and then may selectively restrict enrollment if sufficient evidence accrues that the treatment is not benefiting a subpopulation. We construct two-stage designs of each type that guarantee strong control of the familywise Type I error rate, asymptotically. We then compare performance of the designs from each type under different scenarios; the scenarios mimic key features of a completed non-inferiority trial of HIV treatments. Performance criteria include power, sample size, Type I error, estimator bias, and confidence inteval coverage probability.

We propose a class of adaptive randomized trial designs for comparing two treatments to a common control in two disjoint subpopulations. The type of adaptation, called adaptive enrichment, involves a preplanned rule for modifying enrollment and arm assignment based on accruing data in an ongoing trial. The motivation for this adaptive feature is that interim data may indicate that a subpopulation, such as those with lower disease severity at baseline, are unlikely to benefit from a particular treatment, while uncertainty remains for the other treatment and/or subpopulation. We developed a new multiple testing procedure tailored to this design problem. The procedure improves power by: leveraging the correlation between the test statistics arising from the two treatments being compared to a common control; reallocating alpha across subpopulations, and using the data only through minimally sufficient statistics. We optimize expected sample size over this class of designs, focusing on designs with 2 stages. Our approach is demonstrated in simulation studies that mimic features of a completed trial of a medical device for treating heart failure. User-friendly, open-source software that implements the trial design optimization is provided.

In a regression setting, it is often of interest to quantify the importance of various features in predicting the response. Commonly, the variable importance measure used is determined by the regression technique employed. For this reason, practitioners often only resort to one of a few regression techniques for which a variable importance measure is naturally defined. Unfortunately, these regression techniques are often sub-optimal for predicting response. Additionally, because the variable importance measures native to different regression techniques generally have a different interpretation, comparisons across techniques can be difficult. In this work, we study a novel variable importance measure that can be used with any regression technique, and whose interpretation is agnostic to the technique used. Specifically, we propose a generalization of the ANOVA variable importance measure, and discuss how it facilitates the use of possibly-complex machine learning techniques to flexibly estimate the variable importance of a single feature or group of features. Using the tools of targeted learning, we also describe how to construct an efficient estimator of this measure, as well as a valid confidence interval. Through simulations, we show that our proposal has good practical operating characteristics, and we illustrate its use with data from a study of the median house price in the Boston area, and a study of risk factors for cardiovascular disease in South Africa.

Sitting on top of R’s external pointers, the RcppXPtr class provides
a powerful and generic framework for
Passing user-supplied C++ functions
to a C++ backend. This technique is exploited in the
RcppDE package, an
efficient C++ based implementation of the
DEoptim package that
accepts optimisation objectives as both R and compiled functions (see
demo("compiled", "RcppDE") for further details). This solution has a
couple of issues though:

Some repetitive scaffolding is always needed in order to bring the XPtr to R space.

There is no way of checking whether a user-provided C++ function
complies with the internal signature supported by the C++ backend,
which may lead to weird runtime errors.

Better XPtr handling with RcppXPtrUtils

In a nutshell, RcppXPtrUtils provides functions for dealing with these
two issues: namely, cppXPtr and checkXPtr. As a package author,
you only need to 1) import and re-export cppXPtr to compile code and
transparently retrieve an XPtr, and 2) use checkXPtr to internally
check function signatures.

cppXPtr works in the same way as Rcpp::cppFunction, but instead of
returning a wrapper to directly call the compiled function from R, it
returns an XPtr to be passed to, unwrapped and called from C++. The
returned object is an R’s externalptr wrapped into a class called
XPtr along with additional information about the function signature.

[1] "XPtr"

'double foo(int a, double b)' <pointer: 0x55c64060de40>

The checkXptr function checks the object against a given
signature. If the verification fails, it throws an informative error:

Error in checkXPtr(ptr, "int", c("int", "double")): Bad XPtr signature:
Wrong return type 'double', should be 'int'.

Error in checkXPtr(ptr, "int", c("int")): Bad XPtr signature:
Wrong return type 'double', should be 'int'.
Wrong number of arguments, should be 1'.

Error in checkXPtr(ptr, "int", c("double", "std::string")): Bad XPtr signature:
Wrong return type 'double', should be 'int'.
Wrong argument type 'int', should be 'double'.
Wrong argument type 'double', should be 'std::string'.

Complete use case

First, let us define a templated C++ backend that performs some
processing with a user-supplied function and a couple of adapters:

Note that the user-supplied function takes two arguments: one is also
user-provided and the other is provided by the backend itself. This
core is exposed through the following R function:

Finally, we can compare the XPtr approach with a pure R-based one,
and with a compiled function wrapped in R, as returned by
Rcpp::cppFunction:

Unit: microseconds
expr min lq mean median
execute(func_r, 1.5) 13812.742 15287.713 16429.4728 16017.6470
execute(func_r_cpp, 1.5) 12150.643 13347.326 14482.0998 14145.5830
execute(func_cpp, 1.5) 288.156 369.646 440.1885 400.6895
uq max neval cld
16818.716 53182.418 100 c
15078.917 22634.887 100 b
445.511 1525.653 100 a

When biomarker studies involve patients at multiple centers and the goal is to develop biomarker combinations for diagnosis, prognosis, or screening, we consider evaluating the predictive capacity of a given combination with the center-adjusted AUC (aAUC), a summary of conditional performance. Rather than using a general method to construct the biomarker combination, such as logistic regression, we propose estimating the combination by directly maximizing the aAUC. Furthermore, it may be desirable to have a biomarker combination with similar predictive capacity across centers. To that end, we allow for penalization of the variability in center-specific performance. We demonstrate good asymptotic properties of the resulting combinations. Simulations provide small-sample evidence that maximizing the aAUC can lead to combinations with greater predictive capacity than combinations constructed via logistic regression. We further illustrate the utility of constructing combinations by maximizing the aAUC while penalizing variability. We apply these methods to data from a study of acute kidney injury after cardiac surgery.

C++ templates and function overloading are incompatible with R’s C API, so
polymorphism must be achieved via run-time dispatch, handled explicitly by
the programmer.

The traditional technique for operating on SEXP objects in a generic
manner entails a great deal of boilerplate code, which can be unsightly,
unmaintainable, and error-prone.

The desire to provide polymorphic functions which operate on vectors
and matrices is common enough that Rcpp provides the utility macros
RCPP_RETURN_VECTOR and RCPP_RETURN_MATRIX to simplify the process.

Subsequently, these macros were extended to handle an (essentially)
arbitrary number of arguments, provided that a C++11 compiler is used.

Background

To motivate a discussion of polymorphic functions, imagine that we desire a
function (ends) which, given an input vector x and an integer n, returns
a vector containing the first and last n elements of x concatenated.
Furthermore, we require ends to be a single interface which is capable of
handling multiple types of input vectors (integers, floating point values,
strings, etc.), rather than having a separate function for each case. How can
this be achieved?

R Implementation

A naïve implementation in R might look something like this:

[1] 1 2 3 4 6 7 8 9

[1] "a" "b" "c" "x" "y" "z"

[1] -0.560476 -0.230177 0.701356 -0.472791

The simple function above demonstates a key feature of many dynamically-typed
programming languages, one which has undoubtably been a significant factor in their
rise to popularity: the ability to write generic code with little-to-no
additional effort on the part of the developer. Without getting into a discussion
of the pros and cons of static vs. dynamic typing, it is evident that being able
to dispatch a single function generically on multiple object types, as opposed to,
e.g. having to manage separate impementations of ends for each vector type,
helps us to write more concise, expressive code. Being an article about Rcpp,
however, the story does not end here, and we consider how this problem might
be approached in C++, which has a much more strict type system than R.

C++ Implementation(s)

For simplicity, we begin by considering solutions in the context of a “pure”
(re: not called from R) C++ program. Eschewing more complicated tactics
involving run-time dispatch (virtual functions, etc.), the C++ language
provides us with two straightforward methods of achieving this at compile time:

Although the above program meets our criteria, the code duplication is profound.
Being seasoned C++ programmers, we recognize this
as a textbook use case for templates and refactor accordingly:

This approach is much more maintainable as we have a single implementation
of ends rather than one implementation per typedef. With this in hand, we
now look to make our C++ version of ends callable from R via Rcpp.

Rcpp Implementation (First Attempt)

Many people, myself included, have attempted some variation of the following at
one point or another:

Sadly this does not work: magical as Rcpp attributes may be, there are limits
to what they can do, and at least for the time being, translating C++ template
functions into something compatible with R’s C API is out of the question. Similarly,
the first C++ approach from earlier is also not viable, as the C programming
language does not support function overloading. In fact, C does not
support any flavor of type-safe static polymorphism, meaning that our generic
function must be implemented through run-time polymorphism, as touched on in
Kevin Ushey’s Gallery article Dynamic Wrapping and Recursion with Rcpp.

Rcpp Implementation (Second Attempt)

Armed with the almighty TYPEOF macro and a SEXPTYPE cheatsheat, we
modify the template code like so:

[1] 1 2 3 4 6 7 8 9

[1] "a" "b" "c" "x" "y" "z"

[1] -1.067824 -0.217975 -0.305963 -0.380471

Warning in ends(list()): Invalid SEXPTYPE 19 (VECSXP).

NULL

Some key remarks:

Following the ubiquitous Rcpp idiom, we have converted our ends template to use
an integer parameter instead of a type parameter. This is a crucial point, and
later on, we will exploit it to our benefit.

The template implementation is wrapped in a namespace in order to avoid a
naming conflict; this is a personal preference but not strictly necessary.
Alternatively, we could get rid of the namespace and rename either the template
function or the exported function (or both).

We use the opaque type SEXP for our input / output vector since we need a
single input / output type. In this particular situation, replacing SEXP with
the Rcpp type RObject would also be suitable as it is a generic class capable
of representing any SEXP type.

Since we have used an opaque type for our input vector, we must cast it
to the appropriate Rcpp::Vector type accordingly within each case label. (For
further reference, the list of vector aliases can be found here). Finally, we could dress each return value in Rcpp::wrap to convert
the Rcpp::Vector to a SEXP, but it isn’t necessary because Rcpp attributes
will do this automatically (if possible).

At this point we have a polymorphic function, written in C++, and callable from
R. But that switch statement sure is an eyesore, and it will need to be
implemented every time we wish to export a generic function to R. Aesthetics
aside, a more pressing concern is that boilerplate such as this increases the
likelihood of introducing bugs into our codebase – and since we are leveraging
run-time dispatch, these bugs will not be caught by the compiler. For example,
there is nothing to prevent this from compiling:

In our particular case, such mistakes likely would not be too disastrous, but
it should not be difficult to see how situations like this can put you (or a
user of your library!) on the fast track to segfault.

Obligatory Remark on Macro Safety

The C preprocessor is undeniably one of the more controversial aspects of the
C++ programming language, as its utility as a metaprogramming tool is rivaled
only by its potential for abuse. A proper discussion of the various pitfalls
associated with C-style macros is well beyond the scope of this article, so
the reader is encouraged explore this topic on their own. On the bright side,
the particular macros that we will be discussing are sufficiently complex
and limited in scope that misuse is much more likely to result in a compiler
error than a silent bug, so practically speaking, one can expect a fair bit of
return for relatively little risk.

Synopsis

At a high level, we summarize the RCPP_RETURN macros as follows:

There are two separate macros for dealing with vectors and matrices,
RCPP_RETURN_VECTOR and RCPP_RETURN_MATRIX, respectively.

In either case, code is generated for the following SEXPTYPEs:

INTSXP (integers)

REALSXP (numerics)

RAWSXP (raw bits)

LGLSXP (logicals)

CPLXSXP (complex numbers)

STRSXP (characters / strings)

VECSXP (lists)

EXPRSXP (expressions)

In C++98 mode, each macro accepts two arguments:

A template function

A SEXP object

In C++11 mode (or higher), each macro additionally accepts zero or more
arguments which are forwarded to the template function.

Finally, the template function must meet the following criteria:

It is templated on a single, integer parameter.

In the C++98 case, it accepts a single SEXP (or something convertible to
SEXP) argument.

In the C++11 case, it may accept more than one argument, but the first
argument is subject to the previous constraint.

Examining our templated impl::ends function from the previous section, we see
that it meets the first requirement, but fails the second, due to its second
parameter n. Before exploring how ends might be adapted to meet the (C++98)
template requirements, it will be helpful demonstrate correct usage with a few
simple examples.

Fixed Return Type

We consider two situations where our input type is generic, but our output
type is fixed:

Determining the length (number of elements) of
a vector, in which an int is always returned.

Determining the dimensions (number of rows and number of columns)
of a matrix, in which a length-two IntegerVector is always returned.

First, our len function:

(Note that we omit the return keyword, as it is part of the macro definition.)
Testing this out on the various supported vector types:

[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Similarly, creating a generic function that determines the dimensions of an
input matrix is trivial:

And checking this against base::dim,

[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

everything seems to be in order.

It’s worth pointing out that, for various reasons, it is possible to pass a
matrix object to an Rcpp function which calls RCPP_RETURN_VECTOR:

[1] 9

[1] 9

Although this is sensible in the case of len – and even saves us from
implementing a matrix-specific version – there may be situations where
this behavior is undesirable. To distinguish between the two object types we
can rely on the API function Rf_isMatrix:

[1] 9

<Rcpp::exception in len2(matrix(1:9, 3)): matrix objects not supported.>

We don’t have to worry about the opposite scenario, as this is already handled
within Rcpp library code:

<Rcpp::not_a_matrix in dims(1:5): Not a matrix.>

Generic Return Type

In many cases our return type will correspond to our input type. For example,
exposing the Rcpp sugar function rev is trivial:

As a slightly more complex example, suppose we would like to write a function
to sort matrices which preserves the dimensions of the input, since
base::sort falls short of the latter stipulation:

[1] 1 2 3 4 5 6 7 8 9

There are two obstacles we need to overcome:

The Matrix class does not implement its own sort method. However,
since Matrix inherits from Vector,
we can sort the matrix as a Vector and construct the result from this
sorted data with the appropriate dimensions.

As noted previously, the RCPP_RETURN macros will generate code to handle
exactly 8 SEXPTYPEs; no less, no more. Some functions, like Vector::sort,
are not implemented for all eight of these types, so in order to avoid a
compilation error, we need to add template specializations.

With this in mind, we have the following implementation of msort:

Note that elements will be sorted in column-major order since we filled our
result using this constructor. We can verify that msort works as intended by checking a few test cases:

[,1] [,2] [,3]
[1,] 1 7 4
[2,] 3 9 6
[3,] 5 2 8

[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

[1] 1 2 3 4 5 6 7 8 9

[,1] [,2]
[1,] "a" "y"
[2,] "c" "b"
[3,] "z" "x"

[,1] [,2]
[1,] "a" "x"
[2,] "b" "y"
[3,] "c" "z"

[1] "a" "b" "c" "x" "y" "z"

List of 9
$ : int 1
$ : int 2
$ : int 3
$ : int 4
$ : int 5
$ : int 6
$ : int 7
$ : int 8
$ : int 9
- attr(*, "dim")= int [1:2] 3 3

<Rcpp::exception in msort(x): sort not allowed for lists.>

<simpleError in sort.int(x, na.last = na.last, decreasing = decreasing, ...): 'x' must be atomic>

Revisiting the ‘ends’ Function

Having familiarized ourselves with basic usage of the RCPP_RETURN macros, we
can return to the problem of implementing our ends function with
RCPP_RETURN_VECTOR. Just to recap the situation, the template function
passed to the macro must meet the following two criteria in C++98 mode:

It is templated on a single, integer parameter (representing the
Vector type).

It accepts a singleSEXP (or convertible to SEXP) argument.

Currently ends has the signature

meaning that the first criterion is met, but the second is not. In order
preserve the functionality provided by the int parameter, we effectively
need to generate a new template function which has access to the user-provided
value at run-time, but without passing it as a function parameter.

The technique we are looking for is called partial function application, and it can be implemented
using one of my favorite C++ tools: the functor. Contrary to typical functor
usage, however, our implementation features a slight twist: rather than
using a template class with a non-template function call operator, as is the
case with std::greater, etc., we are
going to make operator() a template itself:

Not bad, right? All in all, the changes are fairly minor:

The function body of Ends::operator() is identical to that of
impl::ends.

n is now a private data member rather than a function parameter, which
gets initialized in the constructor.

Instead of passing a free-standing template function to RCPP_RETURN_VECTOR,
we pass the expression Ends(n), where n is supplied at run-time from the
R session. In turn, the macro will invoke Ends::operator() on the SEXP
(RObject, in our case), using the specified n value.

We can demonstrate this on various test cases:

[1] 1 2 3 4 6 7 8 9

[1] "a" "b" "c" "x" "y" "z"

[1] -0.694707 -0.207917 0.123854 0.215942

A Modern Alternative

As alluded to earlier, a more modern compiler (supporting C++11 or later)
will free us from the “single SEXP argument” restriction, which means
that we no longer have to move additional parameters into a function
object. Here is ends re-implemented using the C++11 version of
RCPP_RETURN_VECTOR (note the // [[Rcpp::plugins(cpp11)]]
attribute declaration):

[1] 1 2 3 4 6 7 8 9

[1] "a" "b" "c" "x" "y" "z"

[1] 0.379639 -0.502323 0.181303 -0.138891

The current definition of RCPP_RETURN_VECTOR and RCPP_RETURN_MATRIX allows for up
to 24 arguments to be passed; although in principal, the true upper bound
depends on your compiler’s implementation of the __VA_ARGS__ macro, which
is likely greater than 24. Having said this, if you find yourself trying
to pass around more than 3 or 4 parameters at once, it’s probably time
to do some refactoring.

One of the problems often dealt in Statistics is minimization of the objective function. And contrary to the linear models, there is no analytical solution for models that are nonlinear on the parameters such as logistic regression, neural networks, and nonlinear regression models (like Michaelis-Menten model). In this situation, we have to use mathematical programming or optimization. And one popular optimization algorithm is the gradient descent, which we're going to illustrate here. To start with, let's consider a simple function with closed-form solution given by \begin{equation} f(\beta) \triangleq \beta^4 - 3\beta^3 + 2. \end{equation} We want to minimize this function with respect to $\beta$. The quick solution to this, as what calculus taught us, is to compute for the first derivative of the function, that is \begin{equation} \frac{\text{d}f(\beta)}{\text{d}\beta}=4\beta^3-9\beta^2. \end{equation} Setting this to 0 to obtain the stationary point gives us \begin{align} \frac{\text{d}f(\beta)}{\text{d}\beta}&\overset{\text{set}}{=}0\nonumber\\ 4\hat{\beta}^3-9\hat{\beta}^2&=0\nonumber\\ 4\hat{\beta}^3&=9\hat{\beta}^2\nonumber\\ 4\hat{\beta}&=9\nonumber\\ \hat{\beta}&=\frac{9}{4}. \end{align}

The following plot shows the minimum of the function at $\hat{\beta}=\frac{9}{4}$ (red line in the plot below).

R ScriptNow let's consider minimizing this problem using gradient descent with the following algorithm:

Initialize $\mathbf{x}_{r},r=0$

while $\lVert \mathbf{x}_{r}-\mathbf{x}_{r+1}\rVert > \nu$

where $\nabla f(\mathbf{x}_r)$ is the gradient of the cost function, $\gamma$ is the learning-rate parameter of the algorithm, and $\nu$ is the precision parameter. For the function above, let the initial guess be $\hat{\beta}_0=4$ and $\gamma=.001$ with $\nu=.00001$. Then $\nabla f(\hat{\beta}_0)=112$, so that \[\hat{\beta}_1=\hat{\beta}_0-.001(112)=3.888.\] And $|\hat{\beta}_1 - \hat{\beta}_0| = 0.112> \nu$. Repeat the process until at some $r$, $|\hat{\beta}_{r}-\hat{\beta}_{r+1}| \ngtr \nu$. It will turn out that 350 iterations are needed to satisfy the desired inequality, the plot of which is in the following figure with estimated minimum $\hat{\beta}_{350}=2.250483\approx\frac{9}{4}$.

R Script with PlotPython ScriptObviously the convergence is slow, and we can adjust this by tuning the learning-rate parameter, for example if we try to increase it into $\gamma=.01$ (change gamma to .01 in the codes above) the algorithm will converge at 42nd iteration. To support that claim, see the steps of its gradient in the plot below.

If we try to change the starting value from 4 to .1 (change beta_new to .1) with $\gamma=.01$, the algorithm converges at 173rd iteration with estimate $\hat{\beta}_{173}=2.249962\approx\frac{9}{4}$ (see the plot below).

Now let's consider another function known as Rosenbrock defined as \begin{equation} f(\mathbf{w})\triangleq(1 - w_1) ^ 2 + 100 (w_2 - w_1^2)^2. \end{equation} The gradient is \begin{align} \nabla f(\mathbf{w})&=[-2(1 - w_1) - 400(w_2 - w_1^2) w_1]\mathbf{i}+200(w_2-w_1^2)\mathbf{j}\nonumber\\ &=\left[\begin{array}{c} -2(1 - w_1) - 400(w_2 - w_1^2) w_1\\ 200(w_2-w_1^2) \end{array}\right]. \end{align} Let the initial guess be $\hat{\mathbf{w}}_0=\left[\begin{array}{c}-1.8\\-.8\end{array}\right]$, $\gamma=.0002$, and $\nu=.00001$. Then $\nabla f(\hat{\mathbf{w}}_0)=\left[\begin{array}{c} -2914.4\\-808.0\end{array}\right]$. So that \begin{equation}\nonumber \hat{\mathbf{w}}_1=\hat{\mathbf{w}}_0-\gamma\nabla f(\hat{\mathbf{w}}_0)=\left[\begin{array}{c} -1.21712 \\-0.63840\end{array}\right]. \end{equation} And $\lVert\hat{\mathbf{w}}_0-\hat{\mathbf{w}}_1\rVert=0.6048666>\nu$. Repeat the process until at some $r$, $\lVert\hat{\mathbf{w}}_r-\hat{\mathbf{w}}_{r+1}\rVert\ngtr \nu$. It will turn out that 23,374 iterations are needed for the desired inequality with estimate $\hat{\mathbf{w}}_{23375}=\left[\begin{array}{c} 0.9464841 \\0.8956111\end{array}\right]$, the contour plot is depicted in the figure below.

R Script with Contour PlotPython ScriptNotice that I did not use ggplot for the contour plot, this is because the plot needs to be updated 23,374 times just to accommodate for the arrows for the trajectory of the gradient vectors, and ggplot is just slow. Finally, we can also visualize the gradient points on the surface as shown in the following figure.

R ScriptIn my future blog post, I hope to apply this algorithm on statistical models like linear/nonlinear regression models for simple illustration.

mlpack is, to quote, a scalable machine learning library, written in C++,
that aims to provide fast, extensible implementations of cutting-edge machine learning
algorithms. It has been written by Ryan Curtin and others, and is
described in two papers in BigLearning (2011) and
JMLR (2013). mlpack uses
Armadillo as the underlying linear algebra library, which, thanks to
RcppArmadillo, is already a rather
well-known library in the R ecosystem.

RcppMLPACK1

Qiang Kou has created the
RcppMLPACK package on CRAN for easy-to-use
integration of mlpack with R. It integrates the
mlpack sources, and is, as a CRAN package, widely available on all
platforms.

However, this RcppMLPACK package is also based on a
by-now dated version of mlpack. Quoting again: mlpack provides these
algorithms as simple command-line programs and C++ classes which can then be integrated into
larger-scale machine learning solutions. Version 2 of the mlpack sources
switched to a slightly more encompassing build also requiring the Boost
libraries ‘program_options’, ‘unit_test_framework’ and ‘serialization’. Within the context of an R
package, we could condition out the first two as R provides both the direct interface (hence no need
to parse command-line options) and also the testing framework. However, it would be both difficult
and potentially undesirable to condition out the serialization which allows
mlpack to store and resume machine learning tasks.

This package works fine on Linux provided mlpack,
Armadillo and Boost are installed.

OS X / macOS

For maxOS / OS X, James Balamuta has tried to set up a homebrew
recipe but there are some tricky interaction with the compiler suites used by both brew and R on
macOS.

Windows

For Windows, one could do what Jeroen Ooms has done and build
(external) libraries. Volunteers are encouraged to get in touch via the issue tickets at GitHub.

Installation from source

Release are available from a drat repository hosted
in the GitHub orgranization RcppMLPACK. So

will use this. If you prefer to rather pick a random commit state,

will work as well.

Example: Logistic Regression

To illustrate mlpack we show a first simple example also included in the
package. As the rest of the Rcpp Gallery, these are “live” code examples.

We can then call this function with the same (trivial) data set as used in the first unit test for
it:

$parameters
[1] 67.9550 -13.6328 -13.6328

Example: Naive Bayes Classifier

A second examples shows the NaiveBayesClassifier class.

We can use the sample data included in recent-enough version of the RcppMLPACK package:

The evaluation of peritoneal dialysis (PD) programmes requires the use of statistical methods that suit the complexity of such programmes. Multi-state regression models taking competing risks into account are a good example of suitable approaches. In this work, multi-state structured additive regression (STAR) models combined with penalized splines (P-splines) are proposed to evaluate peritoneal dialysis programmes. These models are very flexible since they may consider smooth estimates of baseline transition intensities and the inclusion of time-varying and smooth covariate effects at each transition. A key issue in survival analysis is the quantification of the time-dependent predictive accuracy of a given regression model, which is typically assessed using receiver operating characteristic (ROC)’based methodologies. The main objective of the present study is to adapt the concept of time-dependent ROC curve, and their corresponding area under the curve (AUC), to a multi-state competing risks framework. All statistical methodologies discussed in this work were applied to PD survival data. Using a multi-state competing risks framework, this study explored the effects of major clinical covariates on survival such as age, sex, diabetes and previous renal replacement therapy. Such multi-state model was composed of one transient state (peritonitis) and several absorbing states (death, transfer to haemodialysis and renal transplantation). The application of STAR models combined with time-dependent ROC curves revealed important conclusions not previously reported in the nephrology literature when using standard statistical methodologies. For practical application, all the statistical methods proposed in this article were implemented in R and we wrote and made available a script named as NestedCompRisks.

Index measures are commonly used in medical research and clinical practice, primarily for quantification of health risks in individual subjects or patients. The utility of an index measure is ultimately contingent on its ability to predict health outcomes. Construction of medical indices has largely been based on heuristic arguments, although the acceptance of a new index typically requires objective validation, preferably with multiple outcomes. In this article, we propose an analytical tool for index development and validation. We use a multivariate single-index model to ascertain the best functional form for risk index construction. Methodologically, the proposed model represents a multivariate extension of the traditional single-index models. Such an extension is important because it assures that the resultant index simultaneously works for multiple outcomes. The model is developed in the general framework of longitudinal data analysis. We use penalized cubic splines to characterize the index components while leaving the other subject characteristics as additive components. The splines are estimated directly by penalizing nonlinear least squares, and we show that the model can be implemented using existing software. To illustrate, we examine the formation of an adiposity index for prediction of systolic and diastolic blood pressure in children. We assess the performance of the method through a simulation study.

The shared frailty model is a popular tool to analyze correlated right-censored time-to-event data. In the shared frailty model, the latent frailty is assumed to be shared by the members of a cluster and is assigned a parametric distribution, typically a gamma distribution due to its conjugacy. In the case of interval-censored time-to-event data, the inclusion of frailties results in complicated intractable likelihoods. Here, we propose a flexible frailty model for analyzing such data by assuming a smooth semi-parametric form for the conditional time-to-event distribution and a parametric or a flexible form for the frailty distribution. The results of a simulation study suggest that the estimation of regression parameters is robust to misspecification of the frailty distribution (even when the frailty distribution is multimodal or skewed). Given sufficiently large sample sizes and number of clusters, the flexible approach produces smooth and accurate posterior estimates for the baseline survival function and for the frailty density, and it can correctly detect and identify unusual frailty density forms. The methodology is illustrated using dental data from the Signal Tandmobiel® study.

To represent the complex structure of intensive longitudinal data of multiple individuals, we propose a hierarchical Bayesian Dynamic Model (BDM). This BDM is a generalized linear hierarchical model where the individual parameters do not necessarily follow a normal distribution. The model parameters can be estimated on the basis of relatively small sample sizes and in the presence of missing time points. We present the BDM and discuss the model identification, convergence and selection. The use of the BDM is illustrated using data from a randomized clinical trial to study the differential effects of three treatments for panic disorder. The data involves the number of panic attacks experienced weekly (73 individuals, 10–52 time points) during treatment. Presuming that the counts are Poisson distributed, the BDM considered involves a linear trend model with an exponential link function. The final model included a moving average parameter and an external variable (duration of symptoms pre-treatment). Our results show that cognitive behavioural therapy is less effective in reducing panic attacks than serotonin selective re-uptake inhibitors or a combination of both. Post hoc analyses revealed that males show a slightly higher number of panic attacks at the onset of treatment than females.

Most R users will know that data frames are lists. You can easily verify that a data frame is a list by typing

d <- data.frame(id=1:2, name=c("Jon", "Mark"))
d

id name
1 1 Jon
2 2 Mark

is.list(d)

[1] TRUE

However, data frames are lists with some special properties. For example, all entries in the list must have the same length (here 2), etc. You can find a nice description of the differences between lists and data frames here. To access the first column of d, we find that it contains a vector (and a factor in case of column name). Note, that [[ ]] is an operator to select a list element. As data frames are lists, they will work here as well.

is.vector(d[[1]])

[1] TRUE

Data frame columns can contain lists

A long time, I was unaware of the fact, that data frames may also contain lists as columns instead of vectors. For example, let’s assume Jon’s children are Mary and James, and Mark’s children are called Greta and Sally. Their names are stored in a list with two elements. We can add them to the data frame like this:

d$children <- list(c("Mary", "James"), c("Greta", "Sally"))
d

id name children
1 1 Jon Mary, James
2 2 Mark Greta, Sally

A single data frame entry in column children now contains more than one value. Given that the column is a list, not a vector, we cannot go as usual when modifying an entry of the column. For example, to change Jon’s children, we cannot do

> d[1 , "children"] <- c("Mary", "James", "Thomas")
Error in `[<-.data.frame`(`*tmp*`, 1, "children", value = c("Mary", "James", :
replacement has 3 rows, data has 1

Taking into account the list structure of the column, we can type the following to change the values in a single cell.

d[1 , "children"][[1]] <- list(c("Mary", "James", "Thomas"))
# or also
d$children[1] <- list(c("Mary", "James", "Thomas"))
d

id name children
1 1 Jon Mary, James, Thomas
2 2 Mark Greta, Sally

You can also create a data frame having a list as a column using the <tt>data.frame</tt> function, but with a little tweak. The list column has to be wrapped inside the function <tt>I</tt>. This will protect it from several conversions taking place in <tt>data.frame</tt> (see <tt>?I</tt> documentation).

d <- data.frame(id = 1:2,
name = c("Jon", "Mark"),
children = I(list(c("Mary", "James"),
c("Greta", "Sally")))
)

This is an interesting feature, which gives me a deeper understanding of what a data frame is. But when exactly would I want to use it? I have not encountered the need to use it very often yet (though of course there may be plenty of situations where it makes sense). But today I had a case where this feature seemed particularly useful.

Converting lists and data frames to JSON

I had two separate types of information. One stored in a data frame and the other one in a list Referring to the example above, I had

Working with the superb jsonlite package to convert R to JSON, I could do the following to get the result above.

library(jsonlite)
l <- split(d, seq(nrow(d))) # convert data frame rows to list
l <- unname(l) # remove list names
for (i in seq_along(l)) # add element from ch to list
l[[i]] <- c(l[[i]], children=ch[i])
toJSON(l, pretty=T, auto_unbox = T) # convert to JSON

The results are correct, but getting there involved quite a number of tedious steps. These can be avoided by directly placing the list into a column of the data frame. Then jsonlite::toJSON takes care of the rest.

Nice :) What we do here, is basically creating the same nested list structure as above, only now it is disguised as a data frame. However, this approach is much more convenient.

Ever wonder what's the mathematics behind face recognition on most gadgets like digital camera and smartphones? Well for most part it has something to do with statistics. One statistical tool that is capable of doing such feature is the Principal Component Analysis (PCA). In this post, however, we will not do (sorry to disappoint you) face recognition as we reserve this for future post while I'm still doing research on it. Instead, we go through its basic concept and use it for data reduction on spectral bands of the image using R.

Let's view it mathematically

Consider a line $L$ in a parametric form described as a set of all vectors $k\cdot\mathbf{u}+\mathbf{v}$ parameterized by $k\in \mathbb{R}$, where $\mathbf{v}$ is a vector orthogonal to a normalized vector $\mathbf{u}$. Below is the graphical equivalent of the statement: So if given a point $\mathbf{x}=[x_1,x_2]^T$, the orthogonal projection of this point on the line $L$ is given by $(\mathbf{u}^T\mathbf{x})\mathbf{u}+\mathbf{v}$. Graphically, we mean

$Proj$ is the projection of the point $\mathbf{x}$ on the line, where the position of it is defined by the scalar $\mathbf{u}^{T}\mathbf{x}$. Therefore, if we consider $\mathbf{X}=[X_1, X_2]^T$ be a random vector, then the random variable $Y=\mathbf{u}^T\mathbf{X}$ describes the variability of the data on the direction of the normalized vector $\mathbf{u}$. So that $Y$ is a linear combination of $X_i, i=1,2$. The principal component analysis identifies a linear combinations of the original variables $\mathbf{X}$ that contain most of the information, in the sense of variability, contained in the data. The general assumption is that useful information is proportional to the variability. PCA is used for data dimensionality reduction and for interpretation of data. (Ref 1. Bajorski, 2012)

To better understand this, consider two dimensional data set, below is the plot of it along with two lines ($L_1$ and $L_2$) that are orthogonal to each other: If we project the points orthogonally to both lines we have,

So that if normalized vector $\mathbf{u}_1$ defines the direction of $L_1$, then the variability of the points on $L_1$ is described by the random variable $Y_1=\mathbf{u}_1^T\mathbf{X}$. Also if $\mathbf{u}_2$ is a normalized vector that defines the direction of $L_2$, then the variability of the points on this line is described by the random variable $Y_2=\mathbf{u}_2^T\mathbf{X}$. The first principal component is one with maximum variability. So in this case, we can see that $Y_2$ is more variable than $Y_1$, since the points projected on $L_2$ are more dispersed than in $L_1$. In practice, however, the linear combinations $Y_i = \mathbf{u}_i^T\mathbf{X}, i=1,2,\cdots,p$ is maximized sequentially so that $Y_1$ is the linear combination of the first principal component, $Y_2$ is the linear combination of the second principal component, and so on. Further, the estimate of the direction vector $\mathbf{u}$ is simply the normalized eigenvector $\mathbf{e}$ of the variance-covariance matrix $\mathbf{\Sigma}$ of the original variable $\mathbf{X}$. And the variability explained by the principal component is the corresponding eigenvalue $\lambda$. For more details on theory of PCA refer to (Bajorski, 2012) at Reference 1 below.

As promised we will do dimensionality reduction using PCA. We will use the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) data from (Barjorski, 2012), you can use other locations of AVIRIS data that can be downloaded here. However, since for most cases the AVIRIS data contains thousands of bands so for simplicity we will stick with the data given in (Bajorski, 2012) as it was cleaned reducing to 152 bands only.

What is spectral bands?

In imaging, spectral bands refer to the third dimension of the image usually denoted as $\lambda$. For example, RGB image contains red, green and blue bands as shown below along with the first two dimensions $x$ and $y$ that define the resolution of the image.

These are few of the bands that are visible to our eyes, there are other bands that are not visible to us like infrared, and many other in electromagnetic spectrum. That is why in most cases AVIRIS data contains huge number of bands each captures different characteristics of the image. Below is the proper description of the data.

Data

The Airborne Visible/Infrared Imaging Spectrometer (AVIRIS), is a sensor collecting spectral radiance in the range of wavelengths from 400 to 2500 nm. It has been flown on various aircraft platforms, and many images of the Earth’s surface are available. A 100 by 100 pixel AVIRIS image of an urban area in Rochester, NY, near the Lake Ontario shoreline is shown below. The scene has a wide range of natural and man-made material including a mixture of commercial/warehouse and residential neighborhoods, which adds a wide range of spectral diversity. Prior to processing, invalid bands (due to atmospheric water absorption) were removed, reducing the overall dimensionality to 152 bands. This image has been used in Bajorski et al. (2004) and Bajorski (2011a, 2011b). The first 152 values in the AVIRIS Data represent the spectral radiance values (a spectral curve) for the top left pixel. This is followed by spectral curves of the pixels in the first row, followed by the next row, and so on. (Ref. 1 Bajorski, 2012)

To load the data, run the following code:

Above code uses EBImage package, and can be installed from my previous post.

Why do we need to reduce the dimension of the data?

Before we jump in to our analysis, in case you may ask why? Well sometimes it's just difficult to do analysis on high dimensional data, especially on interpreting it. This is because there are dimensions that aren't significant (like redundancy) which adds to our problem on the analysis. So in order to deal with this, we remove those nuisance dimension and deal with the significant one.

To perform PCA in R, we use the function princomp as seen below:

The structure of princomp consist of a list shown above, we will give description to selected outputs. Others can be found in the documentation of the function by executing ?princomp.

sdev - standard deviation, the square root of the eigenvalues $\lambda$ of the variance-covariance matrix $\mathbf{\Sigma}$ of the data, dat.mat;

loadings - eigenvectors $\mathbf{e}$ of the variance-covariance matrix $\mathbf{\Sigma}$ of the data, dat.mat;

scores - the principal component scores.

Recall that the objective of PCA is to find for a linear combination $Y=\mathbf{u}^T\mathbf{X}$ that will maximize the variance $Var(Y)$. So that from the output, the estimate of the components of $\mathbf{u}$ is the entries of the loadings which is a matrix of eigenvectors, where the columns corresponds to the eigenvectors of the sequence of principal components, that is if the first principal component is given by $Y_1=\mathbf{u}_1^T\mathbf{X}$, then the estimate of $\mathbf{u}_1$ which is $\mathbf{e}_1$ (eigenvector) is the set of coefficients obtained from the first column of the loadings. The explained variability of the first principal component is the square of the first standard deviation sdev, the explained variability of the second principal component is the square of the second standard deviation sdev, and so on. Now let's interpret the loadings (coefficients) of the first three principal components. Below is the plot of this, Base above, the coefficients of the first principal component (PC1) are almost all negative. A closer look, the variability in this principal component is mainly explained by the weighted average of radiance of the spectral bands 35 to 100. Analogously, PC2 mainly represents the variability of the weighted average of radiance of spectral bands 1 to 34. And further, the fluctuation of the coefficients of PC3 makes it difficult to tell on which bands greatly contribute on its variability. Aside from examining the loadings, another way to see the impact of the PCs is through the impact plot where the impact curve $\sqrt{\lambda_j}\mathbf{e}_j$ are plotted, I want you to explore that.

Moving on, let's investigate the percent of variability in $X_i$ explained by the $j$th principal component, below is the formula of this, \begin{equation}\nonumber \frac{\lambda_j\cdot e_{ij}^2}{s_{ii}}, \end{equation} where $s_{ii}$ is the estimated variance of $X_i$. So that below is the percent of explained variability in $X_i$ of the first three principal components including the cumulative percent variability (sum of PC1, PC2, and PC3), For the variability of the first 33 bands, PC2 takes on about 90 percent of the explained variability as seen in the above plot. And still have great contribution further to 102 to 152 bands. On the other hand, from bands 37 to 100, PC1 explains almost all the variability with PC2 and PC3 explain 0 to 1 percent only. The sum of the percentage of explained variability of these principal components is indicated as orange line in the above plot, which is the cumulative percent variability.

To wrap up this section, here is the percentage of the explained variability of the first 10 PCs.

PC1

PC2

PC3

PC4

PC5

PC6

PC7

PC8

PC9

PC10

Table 1: Variability Explained by the First Ten Principal Components for the AVIRIS data.

82.057

17.176

0.320

0.182

0.094

0.065

0.037

0.029

0.014

0.005

Above variability were obtained by noting that the variability explained by the principal component is simply the eigenvalue (square of the sdev) of the variance-covariance matrix $\mathbf{\Sigma}$ of the original variable $\mathbf{X}$, hence the percentage of variability explained by the $j$th PC is equal to its corresponding eigenvalue $\lambda_j$ divided by the overall variability which is the sum of the eigenvalues, $\sum_{j=1}^{p}\lambda_j$, as we see in the following code,

Stopping Rules

Given the list of percentage of variability explained by the PCs in Table 1, how many principal components should we take into account that would best represent the variability of the original data? To answer that, we introduce the following stopping rules that will guide us on deciding the number of PCs:

Scree plot;

Simple fare-share;

Broken-stick; and,

Relative broken-stick.

The scree plot is the plot of the variability of the PCs, that is the plot of the eigenvalues. Where we look for an elbow or sudden drop of the eigenvalues on the plot, hence for our example we have Therefore, we need return the first two principal components based on the elbow shape. However, if the eigenvalues differ by order of magnitude, it is recommended to use the logarithmic scale which is illustrated below, Unfortunately, sometimes it won't work as we can see here, it's just difficult to determine where the elbow is. The succeeding discussions on the last three stopping rules are based on (Bajorski, 2012). The simple fair-share stopping rule identifies the largest $k$ such that $\lambda_k$ is larger than its fair share, that is larger than $(\lambda_1+\lambda_2+\cdots+\lambda_p)/p$. To illustrate this, consider the following:

Thus, we need to stop at second principal component.

If one was concerned that the above method produces too many principal components, a broken-stick rule could be used. The rule is that it identifies the principal components with largest $k$ such that $\lambda_j/(\lambda_1+\lambda_2+\cdots +\lambda_p)>a_j$, for all $j\leq k$, where \begin{equation}\nonumber a_j = \frac{1}{p}\sum_{i=j}^{p}\frac{1}{i},\quad j =1,\cdots, p. \end{equation} Let's try it,

Above result coincides with the first two stopping rule. The draw back of simple fair-share and broken-stick rules is that it do not work well when the eigenvalues differ by orders of magnitude. In such case, we then use the relative broken-stick rule, where we analyze $\lambda_j$ as the first eigenvalue in the set $\lambda_j\geq \lambda_{j+1}\geq\cdots\geq\lambda_{p}$, where $j < p$. The dimensionality $k$ is chosen as the largest value such that $\lambda_j/(\lambda_j+\cdots +\lambda_p)>b_j$, for all $j\leq k$, where \begin{equation}\nonumber b_j = \frac{1}{p-j+1}\sum_{i=1}^{p-j+1}\frac{1}{i}. \end{equation} Applying this to the data we have, According to the numerical output, the first 34 principal components are enough to represent the variability of the original data.

Principal Component Scores

The principal component scores is the resulting new data set obtained from the linear combinations $Y_j=\mathbf{e}_j(\mathbf{x}-\bar{\mathbf{x}}), j = 1,\cdots, p$. So that if we use the first three stopping rules, then below is the scores (in image) of PC1 and PC2, If we base on the relative broken-stick rule then we return the first 34 PCs, and below is the corresponding scores (in image).

Click on the image to zoom in.

Residual Analysis

Of course when doing PCA there are errors to be considered unless one would return all the PCs, but that would not make any sense because why would someone apply PCA when you still take into account all the dimensions? An overview of the errors in PCA without going through the theory is that, the overall error is simply the excluded variability explained by the $k$th to $p$th principal components, $k>j$.

Enough with the theory we recently published, let's take a break and have fun on the application of Statistics used in Data Mining and Machine Learning, the k-Means Clustering.

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. (Wikipedia, Ref 1.)

We will apply this method to an image, wherein we group the pixels into k different clusters. Below is the image that we are going to use,

Here’s a phrase you never want to see in print (in a legal decision, no less) pertaining to your academic research: “The IRB process, however, was improperly engaged by the Dartmouth researcher and ignored completely by the Stanford researchers.”

Today I was working on a two-part procrustes problem and wanted to find out why my minimization algorithm sometimes does not converge properly or renders unexpected results. The loss function to be minimized is

with denoting the Frobenius norm, is an unknown scalar and an unknown rotation matrix, i.e. . , and are four real valued matrices. The minimum for is easily found by setting the partial derivation of w.r.t equal to zero.

By plugging into the loss function we get a new loss function that only depends on . This is the starting situation.

When trying to find out why the algorithm to minimize did not work as expected, I got stuck. So I decided to conduct a small simulation and generate random rotation matrices to study the relation between the parameter and the value of the loss function . Before looking at the results for the entire two-part procrustes problem from above, let’s visualize the results for the first part of the loss function only, i.e.

Here, has the same minimum as for the whole formula above. For the simulation I used

as input matrices. Generating many random rotation matrices and plotting against the value of the loss function yields the following plot.

This is a well behaved relation, for each scaling parameter the loss is identical. Now let’s look at the full two-part loss function. As input matrices I used

and the following R-code.

# trace function
tr <- function(X) sum(diag(X))
# random matrix type 1
rmat_1 <- function(n=3, p=3, min=-1, max=1){
matrix(runif(n*p, min, max), ncol=p)
}
# random matrix type 2, sparse
rmat_2 <- function(p=3) {
diag(p)[, sample(1:p, p)]
}
# generate random rotation matrix Q. Based on Q find
# optimal scaling factor c and calculate loss function value
#
one_sample <- function(n=2, p=2)
{
Q <- mixAK::rRotationMatrix(n=1, dim=p) %*% # random rotation matrix det(Q) = 1
diag(sample(c(-1,1), p, rep=T)) # additional reflections, so det(Q) in {-1,1}
s <- tr( t(Q) %*% t(A1) %*% B1 ) / norm(A1, "F")^2 # scaling factor c
rss <- norm(s*A1 %*% Q - B1, "F")^2 + # get residual sum of squares
norm(A2 %*% Q - B2, "F")^2
c(s=s, rss=rss)
}
# find c and rss or many random rotation matrices
#
set.seed(10) # nice case for 3 x 3
n <- 3
p <- 3
A1 <- round(rmat_1(n, p), 1)
B1 <- round(rmat_1(n, p), 1)
A2 <- rmat_2(p)
B2 <- rmat_2(p)
x <- plyr::rdply(40000, one_sample(3,3))
plot(x$s, x$rss, pch=16, cex=.4, xlab="c", ylab="L(Q)", col="#00000010")

This time the result turns out to be very different and … beautiful :)

Here, we do not have a one to one relation between the scaling parameter and the loss function any more. I do not quite know what to make of this yet. But for now I am happy that it has aestethic value. Below you find some more beautiful graphics with different matrices as inputs.

I have used R (and S before it) for a couple of decades. In the last few years most of my coding has been in Julia, a language for technical computing that can provide remarkable performance for a dynamically typed language via Just-In-Time (JIT) compilation of functions and via multiple dispatch.

Nonetheless there are facilities in R that I would like to have access to from Julia. I created the RCall package for Julia to do exactly that. This IJulia notebook provides an introduction to RCall.

This is not a novel idea by any means. Julia already has PyCall and JavaCall packages that provide access to Python and to Java. These packages are used extensively and are much more sophisticated than RCall, at present. Many other languages have facilities to run an embedded instance of R. In fact, Python has several such interfaces.

The things I plan to do using RCall is to access datasets from R and R packages, to fit models that are not currently implemented in Julia and to use R graphics, especially the ggplot2 and lattice packages. Unfortunately I am not currently able to start a graphics device from the embedded R but I expect that to be fixed soon.

I can tell you the most remarkable aspect of RCall although it may not mean much if you haven't tried to do this kind of thing. It is written entirely in Julia. There is absolutely no "glue" code written in a compiled language like C or C++. As I said, this may not mean much to you unless you have tried to do something like this, in which case it is astonishing.

We teach two software packages, R and SPSS, in Quantitative Methods 101 for psychology freshman at Bremen University (Germany). Sometimes confusion arises, when the software packages produce different results. This may be due to specifics in the implemention of a method or, as in most cases, to different default settings. One of these situations occurs when the QQ-plot is introduced. Below we see two QQ-plots, produced by SPSS and R, respectively. The data used in the plots were generated by:

set.seed(0)
x <- sample(0:9, 100, rep=T)

SPSS

R

qqnorm(x, datax=T) # uses Blom's method by default
qqline(x, datax=T)

There are some obvious differences:

The most obvious one is that the R plot seems to contain more data points than the SPSS plot. Actually, this is not the case. Some data points are plotted on top of each in SPSS while they are spread out vertically in the R plot. The reason for this difference is that SPSS uses a different approach assigning probabilities to the values. We will expore the two approaches below.

The scaling of the y-axis differs. R uses quantiles from the standard normal distribution. SPSS by default rescales these values using the mean and standard deviation from the original data. This allows to directly compare the original and theoretical values. This is a simple linear transformation and will not be explained any further here.

The QQ-lines are not identical. R uses the 1st and 3rd quartile from both distributions to draw the line. This is different in SPSS where of a line is drawn for identical values on both axes. We will expore the differences below.

QQ-plots from scratch

To get a better understanding of the difference we will build the R and SPSS-flavored QQ-plot from scratch.

R type

In order to calculate theoretical quantiles corresponding to the observed values, we first need to find a way to assign a probability to each value of the original data. A lot of different approaches exist for this purpose (for an overview see e.g. Castillo-Gutiérrez, Lozano-Aguilera, & Estudillo-Martínez, 2012b). They usually build on the ranks of the observed data points to calculate corresponding p-values, i.e. the plotting positions for each point. The qqnorm function uses two formulae for this purpose, depending on the number of observations (Blom’s mfethod, see ?qqnorm; Blom, 1958). With being the rank, for it will use the formula , for the formula to determine the probability value for each observation (see the help files for the functions qqnorm and ppoint). For simplicity reasons, we will only implement the case here.

n <- length(x) # number of observations
r <- order(order(x)) # order of values, i.e. ranks without averaged ties
p <- (r - 1/2) / n # assign to ranks using Blom's method
y <- qnorm(p) # theoretical standard normal quantiles for p values
plot(x, y) # plot empirical against theoretical values

Before we take at look at the code, note that our plot is identical to the plot generated by qqnorm above, except that the QQ-line is missing. The main point that makes the difference between R and SPSS is found in the command order(order(x)). The command calculates ranks for the observations using ordinal ranking. This means that all observations get different ranks and no average ranks are calculated for ties, i.e. for observations with equal values. Another approach would be to apply fractional ranking and calculate average values for ties. This is what the function rank does. The following codes shows the difference between the two approaches to assign ranks.

v <- c(1,1,2,3,3)
order(order(v)) # ordinal ranking used by R

## [1] 1 2 3 4 5

rank(v) # fractional ranking used by SPSS

## [1] 1.5 1.5 3.0 4.5 4.5

R uses ordinal ranking and SPSS uses fractional ranking by default to assign ranks to values. Thus, the positions do not overlap in R as each ordered observation is assigned a different rank and therefore a different p-value. We will pick up the second approach again later, when we reproduce the SPSS-flavored plot in R.^{1}

The second difference between the plots concerned the scaling of the y-axis and was already clarified above.

The last point to understand is how the QQ-line is drawn in R. Looking at the probs argument of qqline reveals that it uses the 1st and 3rd quartile of the original data and theoretical distribution to determine the reference points for the line. We will draw the line between the quartiles in red and overlay it with the line produced by qqline to see if our code is correct.

plot(x, y) # plot empirical against theoretical values
ps <- c(.25, .75) # reference probabilities
a <- quantile(x, ps) # empirical quantiles
b <- qnorm(ps) # theoretical quantiles
lines(a, b, lwd=4, col="red") # our QQ line in red
qqline(x, datax=T) # R QQ line

The reason for different lines in R and SPSS is that several approaches to fitting a straight line exist (for an overview see e.g. Castillo-Gutiérrez, Lozano-Aguilera, & Estudillo-Martínez, 2012a). Each approach has different advantages. The method used by R is more robust when we expect values to diverge from normality in the tails, and we are primarily interested in the normality of the middle range of our data. In other words, the method of fitting an adequate QQ-line depends on the purpose of the plot. An explanation of the rationale of the R approach can e.g. be found here.

SPSS type

The default SPSS approach also uses Blom’s method to assign probabilities to ranks (you may choose other methods is SPSS) and differs from the one above in the following aspects:

a) As already mentioned, SPSS uses ranks with averaged ties (fractional rankings) not the plain order ranks (ordinal ranking) as in R to derive the corresponding probabilities for each data point. The rest of the code is identical to the one above, though I am not sure if SPSS distinguishes between the case.

b) The theoretical quantiles are scaled to match the estimated mean and standard deviation of the original data.

c) The QQ-line goes through all quantiles with identical values on the x and y axis.

n <- length(x) # number of observations
r <- rank(x) # a) ranks using fractional ranking (averaging ties)
p <- (r - 1/2) / n # assign to ranks using Blom's method
y <- qnorm(p) # theoretical standard normal quantiles for p values
y <- y * sd(x) + mean(x) # b) transform SND quantiles to mean and sd from original data
plot(x, y) # plot empirical against theoretical values

Lastly, let us add the line. As the scaling of both axes is the same, the line goes through the origin with a slope of .

abline(0,1) # c) slope 0 through origin

The comparison to the SPSS output shows that they are (visually) identical.

Function for SPSS-type QQ-plot

The whole point of this demonstration was to pinpoint and explain the differences between a QQ-plot generated in R and SPSS, so it will no longer be a reason for confusion. Note, however, that SPSS offers a whole range of options to generate the plot. For example, you can select the method to assign probabilities to ranks and decide how to treat ties. The plots above used the default setting (Blom’s method and averaging across ties). Personally I like the SPSS version. That is why I implemented the function qqnorm_spss in the ryouready package, that accompanies the course. The formulae for the different methods to assign probabilities to ranks can be found in Castillo-Gutiérrez et al. (2012b). The implentation is a preliminary version that has not yet been thoroughly tested. You can find the code here. Please report any bugs or suggestions for improvements (which are very welcome) in the github issues section.

library(devtools)
install_github("markheckmann/ryouready") # install from github repo
library(ryouready) # load package
library(ggplot2)
qq <- qqnorm_spss(x, method=1, ties.method="average") # Blom's method with averaged ties
plot(qq) # generate QQ-plot
ggplot(qq) # use ggplot2 to generate QQ-plot

Literature

Blom, G. (1958). Statistical Estimates and Transformed Beta Variables. Wiley.

Technical sidenote: Internally, qqnorm uses the function ppoints to generate the p-values. Type in stats:::qqnorm.default to the console to have a look at the code. ↩

Update: The links to all my github gists on blogger are broken, and I can't figure out how to fix them. If you know how to insert gitub gists on a dynamic blogger template, please let me known.

In the meantime, here are instructions with links to the code: First of all, use homebrew to compile openblas. It's easy! Second of all, you can also use homebrew to install R! (But maybe stick with the CRAN version unless you really want to compile your own R binary)

Inspired by this post, I decided to try using OpenBLAS for R on my mac. However, it turns out there's a simpler option, using the vecLib BLAS library, which is provided by Apple as part of the accelerate framework.

If you are using R 2.15, follow these instructions to change your BLAS from the default to vecLib:

However, as noted in r-sig-mac, these instructions do not work for R 3.0. You have to directly link to the accelerate framework's version of vecLib:

Finally, test your new blas using this script:

On my system (a retina macbook pro), the default BLAS takes 141 seconds and vecLib takes 43 seconds, which is a significant speedup. If you plan to use vecLib, note the following warning from the R development team "Although fast, it is not under our control and may possibly deliver inaccurate results."

So far, I have not encountered any issues using vecLib, but it's only been a few hours :-).

If you do this, make sure to change the directories to point to the correct location on your system (e.g. change /users/zach/source to whatever directory you clone the git repo into). On my system, the benchmark script takes ~41 seconds when using openBLAS, which is a small but significant speedup.

Assuming that the “no” vote prevails in the Scottish independence referendum, the next question for the United Kingdom is to consider constitutional reform to implement a quasi-federal system and resolve the West Lothian question once and for all. In some ways, it may also provide an opportunity to resolve the stalledreform of the upper house as well. Here’s the rough outline of a proposal that might work.

Devolve identical powers to England, Northern Ireland, Scotland, and Wales, with the proviso that local self-rule can be suspended if necessary by the federal legislature (by a supermajority).

The existing House of Commons becomes the House of Commons for England, which (along with the Sovereign) shall comprise the English Parliament. This parliament would function much as the existing devolved legislatures in Scotland and Wales; the consociational structure of the Northern Ireland Assembly (requiring double majorities) would not be replicated.

The House of Lords is abolished, and replaced with a directly-elected Senate of the United Kingdom. The Senate will have authority to legislate on the non-devolved powers (in American parlance, “delegated” powers) such as foreign and European Union affairs, trade and commerce, national defense, and on matters involving Crown dependencies and territories, the authority to legislate on devolved matters in the event self-government is suspended in a constituent country, and dilatory powers including a qualified veto (requiring a supermajority) over the legislation proposed by a constituent country’s parliament. The latter power would effectively replace the review powers of the existing House of Lords; it would function much as the Council of Revision in Madison’s original plan for the U.S. Constitution.

As the Senate will have relatively limited powers, it need not be as large as the existing Lords or Commons. To ensure the countries other than England have a meaningful voice, given that nearly 85% of the UK’s population is in England, two-thirds of the seats would be allocated proportionally based on population and one-third allocated equally to the four constituent countries. This would still result in a chamber with a large English majority (around 64.4%) but nonetheless would ensure the other three countries would have meaningful representation as well.

I wanted to reproduce a similar figure in R using pictograms and additionally color them e.g. by group membership . I have almost no knowledge about image processing, so I tried out several methods of how to achieve what I want. The first thing I did was read in an PNG file and look at the data structure. The package png allows to read in PNG files. Note that all of the below may not work on Windows machines, as it does not support semi-transparency (see ?readPNG).

The object is a numerical array with four layers (red, green, blue, alpha; short RGBA). Let’s have a look at the first layer (red) and replace all non-zero entries by a one and the zeros by a dot. This will show us the pattern of non-zero values and we already see the contours.

To display the image in R one way is to raster the image (i.e. the RGBA layers are collapsed into a layer of single HEX value) and print it using rasterImage.

Now we have an idea of how the image object and the rastered object look like from the inside. Let’s start to modify the images to suit our needs.

In order to change the color of the pictograms, my first idea was to convert the graphics to greyscale and remap the values to a color ramp of may choice. To convert to greyscale there are tons of methods around (see e.g. here). I just pick one of them I found on SO by chance. With R=Red, G=Green and B=Blue we have

Okay, that basically does the job. Now we will apply it to the wine pictograms.
Let’s use this wine glass from Wikimedia Commons. It’s quite big so I uploaded a reduced size version to imgur . We will use it for our purposes.

# load file from web
f <- tempfile()
download.file("http://i.imgur.com/A14ntCt.png", f)
img <- readPNG(f)
img <- as.raster(img)
r <- nrow(img) / ncol(img)
s <- 1
# let's create a function that returns a ramp function to save typing
ramp <- function(colors)
function(x) rgb(colorRamp(colors)(x), maxColorValue = 255)
# create dataframe with coordinates and colors
set.seed(1)
x <- data.frame(x=rnorm(16, c(2,2,4,4)),
y=rnorm(16, c(1,3)),
colors=c("black", "darkred", "garkgreen", "darkblue"))
plot(c(1,6), c(0,5), type="n", xlab="", ylab="", asp=1)
for (i in 1L:nrow(x)) {
colorramp <- ramp(c(x[i,3], "white"))
img2 <- img_to_colorramp(img, colorramp)
rasterImage(img2, x[i,1], x[i,2], x[i,1]+s/r, x[i,2]+s)
}

Another approach would be to modifying the RGB layers before rastering to HEX values.

img <- readPNG(system.file("img", "Rlogo.png", package="png"))
img2 <- img
img[,,1] <- 0 # remove Red component
img[,,2] <- 0 # remove Green component
img[,,3] <- 1 # Set Blue to max
img <- as.raster(img)
r <- nrow(img) / ncol(img) # size ratio
s <- 3.5 # size
plot(c(0,10), c(0,3.5), type = "n", xlab = "", ylab = "", asp=1)
rasterImage(img, 0, 0, 0+s/r, 0+s)
img2[,,1] <- 1 # Red to max
img2[,,2] <- 0
img2[,,3] <- 0
rasterImage(as.raster(img2), 5, 0, 5+s/r, 0+s)

To just colorize the image, we could weight each layer.

# wrap weighting into function
weight_layers <- function(img, w) {
for (i in seq_along(w))
img[,,i] <- img[,,i] * w[i]
img
}
plot(c(0,10), c(0,3.5), type = "n", xlab = "", ylab = "", asp=1)
img <- readPNG(system.file("img", "Rlogo.png", package="png"))
img2 <- weight_layers(img, c(.2, 1,.2))
rasterImage(img2, 0, 0, 0+s/r, 0+s)
img3 <- weight_layers(img, c(1,0,0))
rasterImage(img3, 5, 0, 5+s/r, 0+s)

After playing around and hard-coding the modifications I started to search and found the EBimage package which has a lot of features for image processing that make ones life (in this case only a bit) easier.

library(EBImage)
f <- system.file("img", "Rlogo.png", package="png")
img <- readImage(f)
img2 <- img
img[,,2] = 0 # zero out green layer
img[,,3] = 0 # zero out blue layer
img <- as.raster(img)
img2[,,1] = 0
img2[,,3] = 0
img2 <- as.raster(img2)
r <- nrow(img) / ncol(img)
s <- 3.5
plot(c(0,10), c(0,3.5), type = "n", xlab = "", ylab = "", asp=1)
rasterImage(img, 0, 0, 0+s/r, 0+s)
rasterImage(img2, 5, 0, 5+s/r, 0+s)

EBImage is a good choice and fairly easy to handle. Now let’s again print the pictograms.

f <- tempfile(fileext=".png")
download.file("http://i.imgur.com/A14ntCt.png", f)
img <- readImage(f)
# will replace whole image layers by one value
# only makes sense if there is a alpha layer that
# gives the contours
#
mod_color <- function(img, col) {
v <- col2rgb(col) / 255
img = channel(img, 'rgb')
img[,,1] = v[1] # Red
img[,,2] = v[2] # Green
img[,,3] = v[3] # Blue
as.raster(img)
}
r <- nrow(img) / ncol(img) # get image ratio
s <- 1 # size
# create random data
set.seed(1)
x <- data.frame(x=rnorm(16, c(2,2,4,4)),
y=rnorm(16, c(1,3)),
colors=1:4)
# plot pictograms
plot(c(1,6), c(0,5), type="n", xlab="", ylab="", asp=1)
for (i in 1L:nrow(x)) {
img2 <- mod_color(img, x[i, 3])
rasterImage(img2, x[i,1], x[i,2], x[i,1]+s*r, x[i,2]+s)
}

Note, that above I did not bother to center each pictogram to position it correctly. This still needs to be done. Anyway, that’s it! Mission completed.

Literature

Abdi, H., & Valentin, D. (2007). Multiple factor analysis (MFA). In N. Salkind (Ed.), Encyclopedia of Measurement and Statistics (pp. 1–14). Thousand Oaks, CA: Sage Publications. Retrieved from https://www.utdallas.edu/~herve/Abdi-MFA2007-pretty.pdf

Kevin Drum asks a bunch of questions about soccer:

Outside the penalty area there’s a hemisphere about 20 yards wide. I can’t recall ever seeing it used for anything. What’s it for?

On several occasions, I’ve noticed that if the ball goes out of bounds at the end of stoppage time, the referee doesn’t whistle the match over. Instead, he waits for the throw-in, and then immediately whistles the match over. What’s the point of this?

Speaking of stoppage time, how has it managed to last through the years? I know, I know: tradition. But seriously. Having a timekeeper who stops the clock for goals, free kicks, etc. has lots of upside and no downside. Right? It wouldn’t change the game in any way, it would just make timekeeping more accurate, more consistent, and more transparent for the fans and players. Why keep up the current pretense?

What’s the best way to get a better sense of what’s a foul and what’s a legal tackle? Obviously you can’t tell from the players’ reactions, since they all writhe around like landed fish if they so much as trip over their own shoelaces. Reading the rules provides the basics, but doesn’t really help a newbie very much. Maybe a video that shows a lot of different tackles and explains why each one is legal, not legal, bookable, etc.?

The first one’s easy: there’s a general rule that no defensive player can be within 10 yards of the spot of a direct free kick. A penalty kick (which is a type of direct free kick) takes place in the 18-yard box, and no players other than the player taking the kick and the goalkeeper are allowed in the box. However, owing to geometry, the 18 yard box and the 10 yard exclusion zone don’t fully coincide, hence the penalty arc. (That’s also why there are two tiny hash-marks on the goal line and side line 10 yards from the corner flag. And why now referees have a can of shaving cream to mark the 10 yards for other free kicks, one of the few MLS innovations that has been a good idea.)

Second one’s also easy: the half and the game cannot end while the ball is out of play.

Third one’s harder. First, keeping time inexactly forestalls the silly premature celebrations that are common in most US sports. You’d never see the Stanford-Cal play happen in a soccer game. Second, it allows some slippage for short delays and doesn’t require exact timekeeping; granted, this was more valuable before instant replays and fourth officials, but most US sports require a lot of administrative record-keeping by ancillary officials. A soccer game can be played with one official (and often is, particularly at the amateur level) without having to change timing rules;* in developing countries in particular this lowers the barriers to entry for the sport (along with the low equipment requirements) without changing the nature of the game appreciably. Perhaps most importantly, if the clock was allowed to stop regularly it would create an excuse for commercial timeouts and advertising breaks, which would interrupt the flow of the game and potentially reduce the advantages of better-conditioned and more skilled athletes. (MLS tried this, along with other exciting American ideas like “no tied games,” and it was as appealing to actual soccer fans as ketchup on filet mignon would be to a foodie, and perhaps more importantly didn’t make any non-soccer fans watch.)

Fourth, the key distinction is usually whether there was an obvious attempt to play the ball; in addition, in the modern game, even some attempts to play the ball are considered inherently dangerous (tackling from behind, many sliding tackles, etc.) and therefore are fouls even if they are successful in getting more ball than human.

* To call offside, you’d also probably need what in my day we called a “linesman.”

Probably the worst-kept non-secret is that the next stage of the institutional evolution of my current employer is to some ill-defined concept of “university status,” which mostly involves the establishment of some to-be-determined master’s degree programs. In the context of the University System of Georgia, it means a small jump from the “state college” prestige tier (a motley collection of schools that largely started out as two-year community colleges and transfer institutions) to the “state university” tier (which is where most of the ex-normal schools hang out these days). What is yet to be determined is how that transition will affect the broader institution that will be the University of Middle Georgia.* People on high are said to be working on these things; in any event, here are my assorted random thoughts on what might be reasonable things to pursue:

Marketing and positioning: Unlike the situation facing many of the other USG institutions, the population of the two anchor counties of our core service area (Bibb and Houston) is growing, and Houston County in particular has a statewide reputation for the quality of its public school system. Rather than conceding that the most prepared students from these schools will go to Athens or Atlanta or Valdosta, we should strongly market our institutional advantages over these more “prestigious” institutions, particularly in terms of the student experience in the first two years and the core curriculum: we have no large lecture courses, no teaching assistants, no lengthy bus rides to and from class every day, and the vast majority of the core is taught by full-time faculty with terminal degrees. Not to mention costs to students are much lower, particularly in the case of students who do not qualify for need-based aid. Even if we were to “lose” these students as transfers to the top-tier institutions after 1–4 semesters, we’d still benefit from the tuition and fees they bring in and we would not be penalized in the upcoming state performance funding formula. Dual enrollment in Warner Robins in particular is an opportunity to showcase our institution as a real alternative for better prepared students rather than a safety school.

Comprehensive offerings at the bachelor’s level: As a state university, we will need to offer a comprehensive range of options for bachelor’s students to attract and retain students, both traditional and nontraditional. In particular, B.S. degrees in political science and sociology with emphasis in applied empirical skills would meet public and private employer demand for workers who have research skills and the ability to collect, manage, understand, and use data appropriately. There are other gaps in the liberal arts and sciences as well that need to be addressed to become a truly comprehensive state university.

Create incentives to boost the residential population: The college currently has a heavy debt burden inherited from the overbuilding of dorms at the Cochran campus. We need to identify ways to encourage students to live in Cochran, which may require public-private partnerships to try to build a “college town” atmosphere in the community near campus. We also need to work with wireless providers like Sprint and T-Mobile to ensure that students from the “big city” can fully use their cell phones and tablets in Cochran and Eastman without roaming fees or changing wireless providers.

Tie the institution more closely to the communities we serve: This includes both physical ties and psychological ties. The Macon campus in particular has poor physical links to the city itself for students who might walk or ride bicycles; extending the existing bike/walking trail from Wesleyan to the Macon campus should be a priority, as should pedestrian access and bike facilities along Columbus Road. Access to the Warner Robins campus is somewhat better but still could be improved. More generally, the institution is perceived as an afterthought or alternative of last resort in the community. Improving this situation and perception among community leaders and political figures may require a physical presence in or near downtown Macon, perhaps in partnership with the GCSU Graduate Center.

* There is no official name-in-waiting, but given that our former interim president seemed to believe he could will this name into existence by repeating it enough I’ll stick with it. The straw poll of faculty trivia night suggests that it’s the least bad option available, which inevitably means the regents will choose something else instead (if the last name change is anything to go by).

I've been putting off sharing this idea because I've
heard the rumors about what happens to folks who aren't security
experts when they post about security on the internet. If this blog is
replaced with cat photos and rainbows, you'll know what happened.

The Sad Truth

It's 2014 and chances are you have accounts on websites that are not
properly handling user passwords. I did no research to produce the
following list of ways passwords are mishandled in decreasing order of
frequency:

Site uses a fast hashing algorithm, typically SHA1(salt + plain-password).

Site doesn't salt password hashes

Site stores raw passwords

We know that sites should be generating secure random salts and using
an established slow hashing algorithm (bcrypt, scrypt, or PBKDF2). Why
are sites not doing this?

While security issues deserve a top spot on any site's priority list,
new features often trump addressing legacy security concerns. The
immediacy of the risk is hard to quantify and it's easy to fall prey
to a "nothing bad has happened yet, why should we change now"
attitude. It's easy for other bugs, features, or performance issues to
win out when measured by immediate impact. Fixing security or other
"legacy" issues is the Right Thing To Do and often you will see no
measurable benefit from the investment. It's like having
insurance. You don't need it until you do.

Specific to the improper storage of user password data is the issue of
the impact to a site imposed by upgrading. There are two common
approaches to upgrading password storage. You can switch cold turkey
to the improved algorithms and force password resets on all of your
users. Alternatively, you can migrate incrementally such that new
users and any user who changes their password gets the increased
security.

The cold turkey approach is not a great user experience and sites
might choose to delay an upgrade to avoid admitting to a weak
security implementation and disrupting their site by forcing password
resets.

The incremental approach is more appealing, but the security benefit
is drastically diminished for any site with a substantial set of
existing users.

Given the above migration choices, perhaps it's (slightly) less
surprising that businesses choose to prioritize other work ahead of
fixing poorly stored user password data.

The Idea

What if you could upgrade a site so that both new and existing users
immediately benefited from the increased security, but without the
disruption of password resets? It turns out that you can and it isn't
very hard.

Consider a user table with columns:

userid
salt
hashed_pass

Where the hashed_pass column is computed using a weak fast
algorithm, for example SHA1(salt + plain_pass).

The core of the idea is to apply a proper algorithm on top of the data
we already have. I'll use bcrypt to make the discussion
concrete. Add columns to the user table as follows:

userid
salt
hashed_pass
hash_type
salt2

Process the existing user table by computing bcrypt(salt2 +
hashed_pass) and storing the result in the hashed_pass column
(overwriting the less secure value); save the new salt value to
salt2 and set hash_type to bycrpt+sha1.

To verify a user where hash_type is bcrypt+sha1, compute
bcrypt(salt2 + SHA1(salt + plain_pass)) and compare to the
hashed_pass value. Note that bcrypt implementations encode the salt
as a prefix of the hashed value so you could avoid the salt2 column,
but it makes the idea easier to explain to have it there.

You can take this approach further and have any user that logs in (as
well as new users) upgrade to a "clean" bcrypt only algorithm since
you can now support different verification algorithms using
hash_type. With the proper application code changes in place, the
upgrade can be done live.

This scheme will also work for sites storing non-salted password
hashes as well as those storing plain text passwords (THE HORROR).

Less Sadness, Maybe

Perhaps this approach makes implementing a password storage security
upgrade more palatable and more likely to be prioritized. And if
there's a horrible flaw in this approach, maybe you'll let me know
without turning this blog into a tangle of cat photos and rainbows.

If you use rebar to generate an OTP release project and want to
have reproducible builds, you need the rebar_lock_deps_plugin
plugin. The plugin provides a lock-deps command that will generate a
rebar.config.lock file containing the complete flattened set of
project dependencies each pegged to a git SHA. The lock file acts
similarly to Bundler's Gemfile.lock file and allows for reproducible
builds (*).

Without lock-deps you might rely on the discipline of using a tag
for all of your application's deps. This is insufficient if any dep
depends on something not specified as a tag. It can also be a problem
if a third party dep doesn't provide a tag. Generating a
rebar.config.lock file solves these issues. Moreover, using
lock-deps can simplify the work of putting together a release
consisting of many of your own repos. If you treat the master branch
as shippable, then rather than tagging each subproject and updating
rebar.config throughout your project's dependency chain, you can
run get-deps (without the lock file), compile, and re-lock at the
latest versions throughout your project repositories.

The reproducibility of builds when using lock-deps depends on the
SHAs captured in rebar.config.lock. The plugin works by scanning the
cloned repos in your project's deps directory and extracting the
current commit SHA. This works great until a repository's history is
rewritten with a force push. If you really want reproducible builds,
you need to not nuke your SHAs and you'll need to fork all third party
repos to ensure that someone else doesn't screw you over in this
fashion either. If you make a habit of only depending on third party
repos using a tag, assume that upstream maintainers are not completely
bat shit crazy, and don't force push your master branch, then you'll
probably be fine.

Getting Started

Install the plugin in your project by adding the following to your
rebar.config file:

rebar get-deps
# the plugin has to be compiled so you can use it
rebar compile
rebar lock-deps

If you'd like to take a look at a project that uses the plugin, take a
look at CHEF'serchef project.

Bonus features

If you are building an OTP release project using rebar generate then
you can use rebar_lock_deps_plugin to enhance your build experience
in three easy steps.

Use rebar bump-rel-version version=$BUMP to automate the process
of editing rel/reltool.config to update the release version. The
argument $BUMP can be major, minor, or patch (default) to
increment the specified part of a semver X.Y.Z version. If
$BUMP is any other value, it is used as the new version
verbatim. Note that this function rewrites rel/reltool.config
using ~p. I check-in the reformatted version and maintain the
formatting when editing. This way, the general case of a version
bump via bump-rel-version results in a minimal diff.

Autogenerate a change summary commit message for all project
deps. Assuming you've generated a new lock file and bumped the
release version, use rebar commit-release to commit the changes
to rebar.config.lock and rel/reltool.config with a commit
message that summarizes the changes made to each dependency between
the previously locked version and the newly locked version. You can
get a preview of the commit message via rebar log-changed-deps.

Finally, create an annotated tag for your new release with rebar
tag-release which will read the current version from
rel/reltool.config and create an annotated tag named with the
version.

The dependencies, they are ordered

Up to version 2.0.1 of rebar_lock_deps_plugin, the dependencies in
the generated lock file were ordered alphabetically. This was a
side-effect of using filelib:wildcard/1 to list the dependencies in
the top-level deps directory. In most cases, the order of the full
dependency set does not matter. However, if some of the code in your
project uses parse transforms, then it will be important for the parse
transform to be compiled and on the code path before attempting to
compile code that uses the parse transform.

This issue was recently discovered by a colleague who ran into build
issues using the lock file for a project that had recently integrated
lager for logging. He came up with the idea of maintaining the
order of deps as they appear in the various rebar.config files along
with a prototype patch proving out the idea. As of
rebar_lock_deps_plugin 3.0.0, the lock-deps command will (mostly)
maintain the relative order of dependencies as found in the
rebar.config files.

The "mostly" is that when a dep is shared across two subprojects, it
will appear in the expected order for the first subproject (based on
the ordering of the two subprojects). The deps for the second
subproject will not be in strict rebar.config order, but the
resulting order should address any compile-time dependencies and be
relatively stable (only changing when project deps alter their deps
with larger impact when shared deps are introduced or removed).

Digression: fun with dependencies

There are times, as a programmer, when a real-world problem looks like
a text book exercise (or an interview whiteboard question). Just the
other day at work we had to design some manhole covers, but I digress.

Fixing the order of the dependencies in the generated lock file is
(nearly) the same as finding an install order for a set of projects
with inter-dependencies. I had some fun coding up the text book
solution even though the approach doesn't handle the constraint of
respecting the order provided by the rebar.config files. Onward
with the digression.

We have a set of "packages" where some packages depend on others and
we want to determine an install order such that a package's
dependencies are always installed before the package. The set of
packages and the relation "depends on" form a directed acyclic graph
or DAG. The topological sort of a DAG produces an install order
for such a graph. The ordering is not unique. For example, with a
single package C depending on A and B, valid install orders are
[A, B, C] and [B, A, C].

To setup the problem, we load all of the project dependency
information into a proplist mapping each package to a list of its
dependencies extracted from the package's rebar.config file.

Erlang's standard library provides the digraph and
digraph_utils modules for constructing and operating on directed
graphs. The digraph_utils module includes a topsort/1 function
which we can make use of for our "exercise". The docs say:

Returns a topological ordering of the vertices of the digraph Digraph
if such an ordering exists, false otherwise. For each vertex in the
returned list, there are no out-neighbours that occur earlier in the
list.

To figure out which way to point the edges when building our graph,
consider two packages A and B with A depending on B. We know we want
to end up with an install order of [B, A]. Rereading the topsort/1
docs, we must want an edge B => A. With that, we can build our DAG
and obtain an install order with the topological sort:

load_digraph(Config, Dir) ->
AllDeps = read_all_deps(Config, Dir),
G = digraph:new(),
Nodes = all_nodes(AllDeps),
[ digraph:add_vertex(G, N) || N <- Nodes ],
%% If A depends on B, then we add an edge A <= B
[
[ digraph:add_edge(G, Dep, Item)
|| Dep <- DepList ]
|| {Item, DepList} <- AllDeps, Item =/= top ],
digraph_utils:topsort(G).
%% extract a sorted unique list of all deps
all_nodes(AllDeps) ->
lists:usort(lists:foldl(fun({top, L}, Acc) ->
L ++ Acc;
({K, L}, Acc) ->
[K|L] ++ Acc
end, [], AllDeps)).

The digraph module manages graphs using ETS giving it a convenient
API, though one that feels un-erlang-y in its reliance on
side-effects.

The above gives an install order, but doesn't take into account the
relative order of deps as specified in the rebar.config files. The
solution implemented in the plugin is a bit less fancy, recursing over
the deps and maintaining the desired ordering. The only tricky bit
being that shared deps are ignored until the end and the entire
linearized list is de-duped which required a . Here's the code:

Have you ever run into a bug that, no matter how careful you are trying to
reproduce it, it only happens sometimes? And then, you think you've got it, and
finally solved it - and tested a couple of times without any manifestation. How
do you know that you have tested enough? Are you sure you were not "lucky" in
your tests?

In this article we will see how to answer those questions and the math
behind it without going into too much detail. This is a pragmatic guide.

The Bug

The following program is supposed to generate two random 8-bit integer and print
them on stdout:

#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
/* Returns -1 if error, other number if ok. */
int get_random_chars(char *r1, char*r2)
{
int f = open("/dev/urandom", O_RDONLY);
if (f < 0)
return -1;
if (read(f, r1, sizeof(*r1)) < 0)
return -1;
if (read(f, r2, sizeof(*r2)) < 0)
return -1;
close(f);
return *r1 & *r2;
}
int main(void)
{
char r1;
char r2;
int ret;
ret = get_random_chars(&r1, &r2);
if (ret < 0)
fprintf(stderr, "error");
else
printf("%d %d\n", r1, r2);
return ret < 0;
}

On my architecture (Linux on IA-32) it has a bug that makes it print "error"
instead of the numbers sometimes.

The Model

Every time we run the program, the bug can either show up or not. It has a
non-deterministic behaviour that requires statistical analysis.

We will model a single program run as a
Bernoulli trial, with success
defined as "seeing the bug", as that is the event we are interested in. We have
the following parameters when using this model:

\(n\): the number of tests made;

\(k\): the number of times the bug was observed in the \(n\) tests;

\(p\): the unknown (and, most of the time, unknowable) probability of seeing
the bug.

As a Bernoulli trial, the number of errors \(k\) of running the program \(n\)
times follows a
binomial distribution
\(k \sim B(n,p)\). We will use this model to estimate \(p\) and to confirm the
hypotheses that the bug no longer exists, after fixing the bug in whichever
way we can.

By using this model we are implicitly assuming that all our tests are performed
independently and identically. In order words: if the bug happens more ofter in
one environment, we either test always in that environment or never; if the bug
gets more and more frequent the longer the computer is running, we reset the
computer after each trial. If we don't do that, we are effectively estimating
the value of \(p\) with trials from different experiments, while in truth each
experiment has its own \(p\). We will find a single value anyway, but it has no
meaning and can lead us to wrong conclusions.

Physical analogy

Another way of thinking about the model and the strategy is by creating a
physical analogy with a box that has an unknown number of green and red balls:

Bernoulli trial: taking a single ball out of the box and looking at its
color - if it is red, we have observed the bug, otherwise we haven't. We then
put the ball back in the box.

\(n\): the total number of trials we have performed.

\(k\): the total number of red balls seen.

\(p\): the total number of red balls in the box divided by the total number of
green balls in the box.

Some things become clearer when we think about this analogy:

If we open the box and count the balls, we can know \(p\), in contrast with
our original problem.

Without opening the box, we can estimate \(p\) by repeating the trial. As
\(n\) increases, our estimate for \(p\) improves. Mathematically:
\[p = \lim_{n\to\infty}\frac{k}{n}\]

Performing the trials in different conditions is like taking balls out of
several different boxes. The results tell us nothing about any single box.

Estimating \(p\)

Before we try fixing anything, we have to know more about the bug, starting by
the probability \(p\) of reproducing it. We can estimate this probability by
dividing the number of times we see the bug \(k\) by the number of times we
tested for it \(n\). Let's try that with our sample bug:

We know from the source code that \(p=25%\), but let's pretend that we don't, as
will be the case with practically every non-deterministic bug. We tested 3
times, so \(k=1, n=3 \Rightarrow p \sim 33%\), right? It would be better if we
tested more, but how much more, and exactly what would be better?

\(p\) precision

Let's go back to our box analogy: imagine that there are 4 balls in the box, one
red and three green. That means that \(p = 1/4\). What are the possible results
when we test three times?

Red balls

Green balls

\(p\) estimate

0

3

0%

1

2

33%

2

1

66%

3

0

100%

The less we test, the smaller our precision is. Roughly, \(p\) precision will
be at most \(1/n\) - in this case, 33%. That's the step of values we can find
for \(p\), and the minimal value for it.

Testing more improves the precision of our estimate.

\(p\) likelihood

Let's now approach the problem from another angle: if \(p = 1/4\), what are the
odds of seeing one error in four tests? Let's name the 4 balls as 0-red,
1-green, 2-green and 3-green:

The table above has all the possible results for getting 4 balls out of the
box. That's \(4^4=256\) rows, generated by this python script.
The same script counts the number of red balls in each row, and outputs the
following table:

k

rows

%

0

81

31.64%

1

108

42.19%

2

54

21.09%

3

12

4.69%

4

1

0.39%

That means that, for \(p=1/4\), we see 1 red ball and 3 green balls only 42% of
the time when getting out 4 balls.

What if \(p = 1/3\) - one red ball and two green balls? We would get the
following table:

k

rows

%

0

16

19.75%

1

32

39.51%

2

24

29.63%

3

8

9.88%

4

1

1.23%

What about \(p = 1/2\)?

k

rows

%

0

1

6.25%

1

4

25.00%

2

6

37.50%

3

4

25.00%

4

1

6.25%

So, let's assume that you've seen the bug once in 4 trials. What is the value of
\(p\)? You know that can happen 42% of the time if \(p=1/4\), but you also know
it can happen 39% of the time if \(p=1/3\), and 25% of the time if \(p=1/2\).
Which one is it?

The graph bellow shows the discrete likelihood for all \(p\) percentual values
for getting 1 red and 3 green balls:

The fact is that, given the data, the estimate for \(p\)
follows a beta distribution
\(Beta(k+1, n-k+1) = Beta(2, 4)\)
(1)
The graph below shows the probability distribution density of \(p\):

The R script used to generate the first plot is here, the
one used for the second plot is here.

Increasing \(n\), narrowing down the interval

What happens when we test more? We obviously increase our precision, as it is at
most \(1/n\), as we said before - there is no way to estimate that \(p=1/3\) when we
only test twice. But there is also another effect: the distribution for \(p\)
gets taller and narrower around the observed ratio \(k/n\):

Investigation framework

So, which value will we use for \(p\)?

The smaller the value of \(p\), the more we have to test to reach a given
confidence in the bug solution.

We must, then, choose the probability of error that we want to tolerate, and
take the smallest value of \(p\) that we can.
A usual value for the probability of error is 5% (2.5% on each side).

That means that we take the value of \(p\) that leaves 2.5% of the area of the
density curve out on the left side. Let's call this value
\(p_{min}\).

That way, if the observed \(k/n\) remains somewhat constant,
\(p_{min}\) will raise, converging to the "real" \(p\) value.

As \(p_{min}\) raises, the amount of testing we have to do after fixing the
bug decreases.

By using this framework we have direct, visual and tangible incentives to test
more. We can objectively measure the potential contribution of each test.

In order to calculate \(p_{min}\) with the mentioned properties, we have
to solve the following equation:

\(alpha\) here is twice the error we want to tolerate: 5% for an error of 2.5%.

That's not a trivial equation to solve for \(p_{min}\). Fortunately, that's
the formula for the confidence interval of the binomial distribution, and there
are a lot of sites that can calculate it:

So, you have tested a lot and calculated \(p_{min}\). The next step is fixing
the bug.

After fixing the bug, you will want to test again, in order to
confirm that the bug is fixed. How much testing is enough testing?

Let's say that \(t\) is the number of times we test the bug after it is fixed.
Then, if our fix is not effective and the bug still presents itself with
a probability greater than the \(p_{min}\) that we calculated, the probability
of not seeing the bug after \(t\) tests is:

\[\alpha = (1-p_{min})^t \]

Here, \(\alpha\) is also the probability of making a
type I error,
while \(1 - \alpha\) is the statistical significance of our tests.

We now have two options:

arbitrarily determining a standard statistical significance and testing enough
times to assert it.

test as much as we can and report the achieved statistical significance.

Both options are valid. The first one is not always feasible, as the cost of
each trial can be high in time and/or other kind of resources.

The standard statistical significance in the industry is 5%, we recommend either
that or less.

This file has the results found after running our program 5000
times. We must never throw out data, but let's pretend that we have tested our
program only 20 times. The observed \(k/n\) ration and the calculated
\(p_{min}\) evolved as shown in the following graph:

After those 20 tests, our \(p_{min}\) is about 12%.

Suppose that we fix the bug and test it again. The following graph shows the
statistical significance corresponding to the number of tests we do:

In words: we have to test 24 times after fixing the bug to reach 95% statistical
significance, and 35 to reach 99%.

Now, what happens if we test more before fixing the bug?

Testing 5000 times

Let's now use all the results and assume that we tested 5000 times before fixing
the bug. The graph bellow shows \(k/n\) and \(p_{min}\):

After those 5000 tests, our \(p_{min}\) is about 23% - much closer
to the real \(p\).

The following graph shows the statistical significance corresponding to the
number of tests we do after fixing the bug:

We can see in that graph that after about 11 tests we reach 95%, and after about
16 we get to 99%. As we have tested more before fixing the bug, we found a
higher \(p_{min}\), and that allowed us to test less after fixing the
bug.

Optimal testing

We have seen that we decrease \(t\) as we increase \(n\), as that can
potentially increases our lower estimate for \(p\). Of course, that value can
decrease as we test, but that means that we "got lucky" in the first trials and
we are getting to know the bug better - the estimate is approaching the real
value in a non-deterministic way, after all.

But, how much should we test before fixing the bug? Which value is an ideal
value for \(n\)?

To define an optimal value for \(n\), we will minimize the sum \(n+t\). This
objective gives us the benefit of minimizing the total amount of testing without
compromising our guarantees. Minimizing the testing can be fundamental if each
test costs significant time and/or resources.

The graph bellow shows us the evolution of the value of \(t\) and \(t+n\) using
the data we generated for our bug:

We can see clearly that there are some low values of \(n\) and \(t\) that give
us the guarantees we need. Those values are \(n = 15\) and \(t = 24\), which
gives us \(t+n = 39\).

While you can use this technique to minimize the total number of tests performed
(even more so when testing is expensive), testing more is always a good thing,
as it always improves our guarantee, be it in \(n\) by providing us with a
better \(p\) or in \(t\) by increasing the statistical significance of the
conclusion that the bug is fixed. So, before fixing the bug, test until you see
the bug at least once, and then at least the amount specified by this
technique - but also test more if you can, there is no upper bound, specially
after fixing the bug. You can then report a higher confidence in the solution.

Conclusions

When a programmer finds a bug that behaves in a non-deterministic way, he
knows he should test enough to know more about the bug, and then even more
after fixing it. In this article we have presented a framework that provides
criteria to define numerically how much testing is "enough" and "even more." The
same technique also provides a method to objectively measure the guarantee that
the amount of testing performed provides, when it is not possible to test
"enough."

We have also provided a real example (even though the bug itself is artificial)
where the framework is applied.

Are you using R for data manipulation for later use with other programs, i.e., a workflow something like this:

read data sets from a disk,

modify the data, and

write it back to a disk.

All fine, but of data set is really big, then you will soon stumble on memory issues. If data processing is simple and you can read only chunks, say only line by line, then the following might be useful:

## Create connection con <- file(description=file,open="r")

## Hopefully you know the number of lines from some other source or com <- paste("wc -l ",file," | awk '{ print $1 }'", sep="") n <- system(command=com, intern=TRUE)

## Loop over a file connection for(i in1:n){ tmp <- scan(file=con, nlines=1, quiet=TRUE) ## do something on a line of data }

Additive genetic covariance between individuals is one of the key concepts in (quantitative) genetics. When doing the prediction of additive genetic values for pedigree members, we need the inverse of the so called numerator relationship matrix (NRM) or simply A. MatrixA has off-diagonal entries equal to numerator of Wright's relationship coefficient and diagonal elements equal to 1 + inbreeding coefficient. I have blogged before about setting up such inverse in R using routine from the ASReml-R program or importing the inverse from the CFC program. However, this is not the only way to "skin this cat" in R. I am aware of the following attempts to provide this feature in R for various things (the list is probably incomplete and I would grateful if you point me to other implementations):

pedigree R package has function makeA() and makeAinv() with obvious meanings; there is also calcG() if you have a lot of marker data instead of pedigree information; there are also some other very handy functions calcInbreeding(), orderPed(), trimPed(), etc.

pedigreemm R package does not have direct implementation to get A inverse, but has all the needed ingredients, which makes the package even more interesting

MCMCglmm R package has function inverseA() which works with pedigree or phlyo objects; there are also handy functions such as prunePed(), rbv(), sm2asreml(), etc.

kinship and kinship2 R packages have function kinship() to setup kinship matrix, which is equal to the half of A; there are also nice functions for plotting pedigrees etc. (see also here)

see also a series of R scripts for relationship matrices

As I described before, the interesting thing is that setting up inverse of A is easier and cheaper than setting up A and inverting it. This is very important for large applications. This is an old result using the following matrix theory. We can decompose symmetric positive definite matrix as A = LU = LL'(Cholesky decomposition) or as A = LDU = LDL' (Generalized Cholesky decomposition), where L (U) is lower (upper) triangular, and D is diagonal matrix. Note that L and U in previous two equations are not the same thing (L from Cholesky is not equal to L from Generalized Cholesky decomposition)! Sorry for sloppy notation. In order to confuse you even more note that Henderson usually wrote A = TDT'. We can even do A = LSSU, where S diagonal is equal to the square root of D diagonal. This can get us back to A = LU = LL' as LSSU = LSSL' = LSS'L' = LS(LS)' = L'L (be ware of sloppy notation)! The inverse rule says that inv(A) = inv(LDU) = inv(U) inv(D) inv(L) = inv(L)' inv(D) inv(L) = inv(L)' inv(S) inv(S) inv(L). I thank to Martin Maechler for pointing out to the last (obviously) bit to me. In Henderson's notation this would be inv(A) = inv(T)' inv(D) inv(T) = inv(T)' inv(S) inv(S) inv(T) Uf ... The important bit is that withNRM (aka A) inv(L) has nice simple structure - it shows the directed graph of additive genetic values in pedigree, while inv(D) tells us about the precision (inverse variance) of additive genetic values given the additive genetic values of parents and therefore depends on knowledge of parents and their inbreeding (the more they are inbred less variation can we expect in their progeny). Both inv(L) and inv(D) are easier to setup.

Packages MCMCglmm and pedigree give us inv(A) directly (we can also get inv(D) in MCMCglmm), but pedigreemm enables us to play around with the above matrix algebra and graph theory. First we need a small example pedigree. Bellow is an example with 10 members and there is also some inbreeding and some individuals have both, one, or no parents known. It is hard to see inbreeding directly from the table, but we will improve that later (see also here).

m package uses Matrix classes in order to store only what we need to store, e.g., matrix U is triangular (t in "dtCMatrix") and matrix A is symmetric (s in "dsCMatrix"). To show the generalized Cholesky A = LDU (or using Henderson notation A = TDT') we use gchol() from the bdsmatrix R package. Matrix T shows the "flow" of genes in pedigree.

Now the A inverse part (inv(A) =inv(T)' inv(D) inv(T) =inv(T)' inv(S) inv(S) inv(T) using Henderson's notation, note that ). The nice thing is that pedigreemm authors provided functions to get inv(T) and D.

The second method (using crossprod) is preferred as it leads directly to symmetric matrix (dsCMatrix), which stores only upper or lower triangle. And make sure you do not do crossprod(TInv %*% sqrt(DInv)) as it is the wrong order of matrices.

As promised we will display (plot) pedigree by use of conversion functions of matrix objects to graph objects using the following code. Two examples are provided using the graph and igraph packages. The former does a very good job on this example, but otherwise igraph seems to have much nicer support for editing etc.

R can be used also as a scripting tool. We just need to add shebang in the first line of a file (script):

#!/usr/bin/Rscript

and then the R code should follow.

Often we want to pass arguments to such a script, which can be collected in the script by the commandArgs() function. Then we need to parse the arguments and conditional on them do something. I came with a rather general way of parsing these arguments using simply these few lines:

It is some work, but I find it pretty neat and use it for quite a while now. I do wonder what others have come up for this task. I hope I did not miss some very general solution.