Previous versions (as known to CRANberries) which should be available via the Archive link are:
2019-02-15 0.2-7
2018-07-19 0.2-6
2018-05-22 0.2-5
2017-10-30 0.2-4
2017-08-02 0.2-3
2017-02-03 0.2-2
2016-11-24 0.2-1
2016-09-14 0.2-0
2016-08-03 0.1-6
Previous versions (as known to CRANberries) which should be available via the Archive link are:
2019-02-15 0.2-7
2018-07-19 0.2-6
2018-05-22 0.2-5
2017-10-30 0.2-4
2017-08-02 0.2-3
2017-02-03 0.2-2
2016-11-24 0.2-1
2016-09-14 0.2-0
2016-08-03 0.1-6
Previous versions (as known to CRANberries) which should be available via the Archive link are:
2019-08-20 2.0.2
2019-05-05 2.0.1
2019-04-12 2.0.0
2019-03-08 1.0.0
Previous versions (as known to CRANberries) which should be available via the Archive link are:
2016-04-23 0.1-4
2016-04-22 0.1-3
2016-03-09 0.1-2
2016-01-15 0.1-1
Previous versions (as known to CRANberries) which should be available via the Archive link are:
2018-11-18 0.1
Another pinp package release! pinp allows for snazzier one or two column Markdown-based pdf vignettes, and is now used by a few packages. A screenshot of the package vignette can be seen below. Additional screenshots are at the pinp page.
This release comes exactly one week (i.e. the minimal time to not earn a NOTE
) after the hot-fix release 0.0.8 which addressed breakage on CRAN tickled by changed in TeX Live. After updating the PNAS style LaTeX macros, and avoiding the issue with an (older) custom copy of titlesec, we now have the real fix, thanks to the eagle-eyed attention of Javier Bezos. The error, as so often, was simple and ours: we had left a stray \makeatother
in pinp.cls
where it may have been in hiding for a while. A very big Thank You! to Javier for spotting it, to Norbert for all his help and to James for double-checking on PNAS.
The good news in all of this is that the package is now in better shape than ever. The newer PNAS style works really well, and I went over a few of our extensions (such as papersize
support for a4 as well as letter), direct on/off off a Draft watermark
, a custom subtitle and more—and they all work consistently. So happy vignette or paper writing!
The NEWS entry for this release follows.
Changes in pinp version 0.0.9 (2019-09-15)
The processing error first addressed in release 0.0.8 is now fixed by removing one stray command; many thanks to Javier Bezos.
The hotfix of also installing titlesec.sty has been reverted.
Processing of the 'papersize' and 'watermark' options was updated.
Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page. For questions or comments use the issue tracker off the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.
A new package of mine arrived on CRAN yesterday, having been uploaded a few days prior on the weekend. It extends the most excellent (and very minimal / zero depends) unit testing package tinytest by Mark van der Loo with the very clever and well-done diffobj package by Brodie Gaslam. Mark also tweeted about it.
The package was written to address a fairly specific need. In teaching STAT 430 at Illinois, I am relying on the powerful PrairieLearn system (developed there) to provides tests, quizzes or homework. Alton and I have put together an autograder for R (which is work in progress, more on that maybe another day), and that uses this package to provides colorized differences between supplied and expected answers in case of an incorrect answer.
Now, the aspect of providing colorized diffs when tests do not evalute to TRUE
is both simple and general enough. As our approach works rather well, I decided to offer the package on CRAN as well. The small screenshot gives a simple idea, the README.md contains a larger screenshoot.
The initial NEWS entries follow below.
Changes in ttdo version 0.0.3 (2019-09-08)
- Added a simple demo to support initial CRAN upload.
Changes in ttdo version 0.0.2 (2019-08-31)
- Updated defaults for format and mode to use the same options used by diffobj along with fallbacks.
Changes in ttdo version 0.0.1 (2019-08-26)
- Initial version, with thanks to both Mark and Brodie.
Please use the GitHub repo and its issues for any questions.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.
A new release of our pinp package is now on CRAN. pinp allows for snazzier one or two column Markdown-based pdf vignettes, and is now used by a few packages. A screenshot of the package vignette can be seen below. Additional screenshots are at the pinp page.
This release was spurned by one of those "CRAN package xyz"
emails I received yesterday: processing of pinp
-using vignettes was breaking at CRAN under the newest TeX Live release present on Debian testing as well as recent Fedora. The rticles package (which uses the PNAS style directly) apparently has a similar issue with PNAS.
Kurt was a usual extremely helpful in debugging, and we narrowed this down to an interaction with the newer versions of titlesec latex package. So for now we did two things: upgrade our code reusing the PNAS class to their newest verson of the PNAS class (as suggested by Norbert whom I also roped in), but also copying in an older version of titlesec.sty
(plus a support file). In the meantime, we are also looking into titlesec directly as Javier offered help—all this was a really decent example of open source firing on all cylinders. It is refreshing.
Because of the move to a newer PNAS version (which seems to clearly help with the occassionally odd formatting of floating blocks near the document end) I may have trampled on earlier extension pull requests. I will reach out to the authors of the PRs to work towards a better process with cleaner diffs, a process I should probably have set up earlier.
The NEWS entry for this release follows.
Changes in pinp version 0.0.8 (2019-09-08)
Two erroraneous 'Provides' were removed from the pinp class.
The upquote package is now use to use actual (non-fancy) quotes in verbatim mode (Dirk fixing #75)
The underlying PNAS style was updated to the most recent v1.44 version of 2018-05-06 to avoid issues with newer TeXLive (Dirk in #79 fixing #77 and #78)
The new PNAS code brings some changes eg watermark is longer an option but typesetting paragraphs seems greatly improved. We may have stomped on an existing behavior, if see please file an issue.
However, it also conflicts with the current texlive version of titlesec so for now we copy
titlesec.sty
(and a support file) in using a prior version, just like we do for pinp.cls and jss.bst.
Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page. For questions or comments use the issue tracker off the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.
A new RcppArmadillo release based on a new Armadillo upstream release arrived on CRAN, and will get to Debian shortly. It brings continued improvements for sparse matrices and a few other things; see below for more details. I also appear to have skipped blogging about the preceding 0.9.600.4.0 release (which was actually extra-rigorous with an unprecedented number of reverse-depends runs) so I included its changes (with very nice sparse matrix improvements) as well.
Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 656 other packages on CRAN.
Changes in RcppArmadillo version 0.9.700.2.0 (2019-09-01)
Upgraded to Armadillo release 9.700.2 (Gangster Democracy)
faster handling of cubes by
vectorise()
faster faster handling of sparse matrices by
nonzeros()
faster row-wise
index_min()
andindex_max()
expanded
join_rows()
andjoin_cols()
to handle joining up to 4 matricesexpanded
.save()
and.load()
to allow storing sparse matrices in CSV formatadded
randperm()
to generate a vector with random permutation of a sequence of integersExpanded the list of known good
gcc
andclang
versions inconfigure.ac
Changes in RcppArmadillo version 0.9.600.4.0 (2019-07-14)
Upgraded to Armadillo release 9.600.4 (Napa Invasion)
faster handling of sparse submatrices
faster handling of sparse diagonal views
faster handling of sparse matrices by
symmatu()
andsymmatl()
faster handling of sparse matrices by
join_cols()
expanded
clamp()
to handle sparse matricesadded
.clean()
to replace elements below a threshold with zeros
Courtesy of CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.
by Virgilio Gómez-Rubio, Paula Moraga, John Molitor, Barry Rowlingson at August 22, 2019 12:00 AM
by Tarak Kharrat, Georgi N. Boshnakov, Ian McHale, Rose Baker at August 21, 2019 12:00 AM
Causal mediation analysis has demonstrated the advantage of mechanism investigation. In conditions with causally ordered mediators, path-specific effects (PSEs) are introduced for specifying the effect subject to a certain combination of mediators. However, most PSEs are unidentifiable. To address this, an alternative approach termed interventional analogue of PSE (iPSE), is widely applied to effect decomposition. Previous studies that have considered multiple mediators have mainly focused on two-mediator cases due to the complexity of the mediation formula. This study proposes a generalized interventional approach for the settings, with the arbitrary number of ordered multiple mediators to study the causal parameter identification as well as statistical estimation. It provides a general definition of iPSEs with a recursive formula, assumptions for nonparametric identification, a regression-based method, and a g-computation algorithm to estimate all iPSEs. We demonstrate that each iPSE reduces to the result of linear structural equation modeling subject to linear or log-linear models. This approach is applied to a Taiwanese cohort study for exploring the mechanism by which hepatitis C virus infection affects mortality through hepatitis B virus infection, liver function, and hepatocellular carcinoma. Software based on a g-computation algorithm allows users to easily apply this method for data analysis subject to various model choices according to the substantive knowledge for each variable. All methods and software proposed in this study contribute to comprehensively decompose a causal effect confirmed by data science and help disentangling causal mechanisms when the natural pathways are complicated.
One of the major goals in large-scale genomic studies is to identify genes with a prognostic impact on time-to-event outcomes which provide insight into the disease's process. With rapid developments in high-throughput genomic technologies in the past two decades, the scientific community is able to monitor the expression levels of tens of thousands of genes and proteins resulting in enormous data sets where the number of genomic features is far greater than the number of subjects. Methods based on univariate Cox regression are often used to select genomic features related to survival outcome; however, the Cox model assumes proportional hazards (PH), which is unlikely to hold for each feature. When applied to genomic features exhibiting some form of non-proportional hazards (NPH), these methods could lead to an under- or over-estimation of the effects. We propose a broad array of marginal screening techniques that aid in feature ranking and selection by accommodating various forms of NPH. First, we develop an approach based on Kullback-Leibler information divergence and the Yang-Prentice model that includes methods for the PH and proportional odds (PO) models as special cases. Next, we propose R^{2} indices for the PH and PO models that can be interpreted in terms of explained variation. Lastly, we propose a generalized pseudo-R^{2} measure that includes PH, PO, crossing hazards and crossing odds models as special cases and can be interpreted as the percentage of separability between subjects experiencing the event and not experiencing the event according to feature expression. We evaluate the performance of our measures using extensive simulation studies and publicly available data sets in cancer genomics. We demonstrate that the proposed methods successfully address the issue of NPH in genomic feature selection and outperform existing methods. The proposed information divergence, R^{2} and pseudo-R^{2} measures were implemented in R (www.R-project.org) and code is available upon request.
The past two decades have witnessed significant advances in high-throughput ``omics" technologies such as genomics, proteomics, metabolomics, transcriptomics and radiomics. These technologies have enabled simultaneous measurement of the expression levels of tens of thousands of features from individual patient samples and have generated enormous amounts of data that require analysis and interpretation. One specific area of interest has been in studying the relationship between these features and patient outcomes, such as overall and recurrence-free survival, with the goal of developing a predictive ``omics" profile. Large-scale studies often suffer from the presence of a large fraction of censored observations and potential time-varying effects of features, and methods for handling them have been lacking. In this paper, we propose supervised methods for feature selection and survival prediction that simultaneously deal with both issues. Our approach utilizes continuum power regression (CPR) - a framework that includes a variety of regression methods - in conjunction with the parametric or semi-parametric accelerated failure time (AFT) model. Both CPR and AFT fall within the linear models framework and, unlike black-box models, the proposed prognostic index has a simple yet useful interpretation. We demonstrate the utility of our methods using simulated and publicly available cancer genomics data.
Standard sample size calculation formulas for Stepped Wedge Cluster Randomized Trials (SW-CRTs) assume that cluster sizes are equal. When cluster sizes vary substantially, ignoring this variation may lead to an under-powered study. We investigate the relative efficiency of a SW-CRT with varying cluster sizes to equal cluster sizes, and derive variance estimators for the intervention effect that account for this variation under the assumption of a mixed effects model; a commonly-used approach for analyzing data from cluster randomized trials. When cluster sizes vary, the power of a SW-CRT depends on the order in which clusters receive the intervention, which is determined through randomization. We first derive a variance formula that corresponds to any particular realization of the randomized sequence and propose efficient algorithms to identify upper and lower bounds of the power. We then obtain an ``expected'' power based on a first-order approximation to the variance formula, where the expectation is taken with respect to all possible randomization sequences. Finally, we provide a variance formula for more general settings where only the mean and coefficient of variation of cluster sizes, instead of exact cluster sizes, are known in the design stage. We evaluate our methods through simulations and illustrate that the power of a SW-CRT decreases as the variation in cluster sizes increases, and the impact is largest when the number of clusters is small.
Most data scientists wished that all data lived neatly managed in some DB. However, in reality, Excel files are ubiquitous and often a common way to disseminate results or data within many companies. Every now and then I found myself in the situation where I wanted to protect Excel sheets against users accidentally changing them. A few months later, however, I found that I sometimes had forgotten the password I used. The “good” thing is that protecting Excel sheets by password is far from safe and access can be recovered quite easily. The following works for .xlsx files only (tested with Excel 2016 files), not the older .xls files.
Before implementing the steps in R, I will outline how to remove the password protection “by hand”. The R way is simply the automation of these steps. The first thing one needs to understand is that a .xlsx file is just a collection of folders and files in a zip container. If you unzip a .xlsx file (e.g. using 7-Zip) you get the following folder structure (sorry, German UI):
In the folder ./xl/worksheets we find one XML file for each Excel sheet. The sheet’s password protection is encoded directly in the sheet. While there used to be the plain password text in the XML in former versions, now, we find the hashed password (see part marked in yellow below). In order to get rid of the password protection, we simply can remove the whole sheetProtection node from the XML. We can do that in any text editor and save the file.
As the last step, we need to recreate the .xlsx file by creating a zip folder that contains our modified XML file (German UI again).
Finally, we change the file extension from .zip back to .xlsx and voilá, we get an Excel file without password protected sheets. Programming the steps outlined above in R is quite straightforward. The steps are commented in the GitHub gist below.
Besides outstanding support for dense matrices, the Armadillo library also provides a great way to manipulate sparse matrices in C++. However, the performance characteristics of dealing with sparse matrices may be surprising if one is only familiar with dense matrices. This is a collection of observations on getting best performance with sparse matrices in Armadillo.
All the timings in this article were generated using Armadillo version 8.500. This version adds a number of substantial optimisations for sparse matrix operations, in some cases speeding things up by as much as two orders of magnitude.
Perhaps the most important thing to note is that the efficiency of sparse algorithms can depend strongly on the level of sparsity in the data. If your matrices and vectors are very sparse (most elements equal to zero), you will often see better performance even if the nominal sizes of those matrices remain the same. This isn’t specific to C++ or Armadillo; it applies to sparse algorithms in general, including the code used in the Matrix package for R. By contrast, algorithms for working with dense matrices usually aren’t affected by sparsity.
Similarly, the pattern of accesssing elements, whether by rows or by columns, can have a significant impact on performance. This is due to caching, which modern CPUs use to speed up memory access: accessing elements that are already in the cache is much faster than retrieving them from main memory. If one iterates or loops over the elements of a matrix in Armadillo, one should try to iterate over columns first, then rows, to maximise the benefits of caching. This applies to both dense and sparse matrices. (Technically, at least for dense matrices, whether to iterate over rows or columns first depends on how the data is stored in memory. Both R and Armadillo store matrices in column-major order, meaning that elements in the same column are contiguous in memory. Sparse matrices are more complex but the advice to iterate by columns is basically the same; see below.)
We start with a simple concrete example: multiplying two matrices together. In R, this can be done
using the %*%
operator which (via the Matrix package) is able to handle any combination of sparse
and dense inputs. However, let us assume we want to do the multiplication in Armadillo: for example
if the inputs are from other C++ functions, or if we want more precise control of the output.
In Armadillo, the *
operator multiplies two matrices together, and this also works for any
combination of sparse and dense inputs. However, the speed of the operation can vary tremendously,
depending on which of those inputs is sparse. To see this, let us define a few simple functions:
The outputs of these functions are all the same, but they take different types of inputs: either two sparse matrices, or a sparse and a dense matrix, or a dense and a sparse matrix (the order matters). Let us call them on some randomly generated data:
user system elapsed 0.230 0.091 0.322
user system elapsed 0.407 0.036 0.442
user system elapsed 1.081 0.100 1.181
user system elapsed 0.826 0.087 0.913
[1] TRUE
While all the times are within an order of magnitude of each other, multiplying a dense and a sparse matrix takes about twice as long as multiplying two sparse matrices together, and multiplying a sparse and dense matrix takes about three times as long. (The sparse-dense multiply is actually one area where Armadillo 8.500 makes massive gains over previous versions. This operation used to take much longer due to using an unoptimised multiplication algorithm.)
Let us see if we can help the performance of the mixed-type functions by creating a temporary sparse copy of the dense input. This forces Armadillo to use the sparse-sparse version of matrix multiply, which as seen above is much more efficient. For example, here is the result of tweaking the dense-sparse multiply. Creating the sparse copy does take some extra time and memory, but not enough to affect the result.
user system elapsed 0.401 0.047 0.448
[1] TRUE
This shows that mixing sparse and dense inputs can have significant effects on the efficiency of your code. To avoid unexpected slowdowns, consider sticking to either sparse or dense classes for all your objects. If one decides to mix them, it is worth remembering to test and profile the code.
Consider another simple computation: multiply the elements of a matrix by their row number, then sum them up. (The multiply by row number is to make it not completely trivial.) That is, we want to obtain:
Armadillo lets us extract individual rows and columns from a matrix, using the .row()
and .col()
member functions. We can use row extraction to do this computation:
[1] 1248361320
user system elapsed 1.708 0.000 1.707
For a large matrix, this takes a not-insignificant amount of time, even on a fast machine. To speed it up, we will make use of the fact that Armadillo uses the compressed sparse column (CSC) format for storing sparse matrices. This means that the data for a matrix is stored as three vectors: the nonzero elements; the row indices of these elements (ordered by column); and a vector of offsets for the first row index in each column. Since the vector of row indices is ordered by column, and we have the starting offsets for each column, it turns out that extracting a column slice is actually very fast. We only need to find the offset for that column, and then pull out the elements and row indices up to the next column offset. On the other hand, extracting a row is much more work; we have to search through the indices to find those matching the desired row.
We can put this knowledge to use on our problem. Row slicing a matrix is the same as transposing it and then slicing by columns, so let us define a new function to return a column slice. (Transposing the matrix takes only a tiny fraction of a second.)
[1] 1248361320
user system elapsed 0.766 0.000 0.766
The time taken has come down by quite a substantial margin. This reflects the ease of obtaining column slices for sparse matrices.
We can take the previous example further. Each time R calls a C++ function that takes a sparse matrix as input, it makes a copy of the data. Similarly, when the C++ function returns, its sparse outputs are copied into R objects. When the function itself is very simple—as it is here—all this copying and memory shuffling can be a significant proportion of the time taken.
Rather than calling sapply
in R to iterate over rows, let us do the same entirely in C++:
[1] 1248361320
user system elapsed 0.933 0.000 0.935
This is again a large improvement. But what if we do the same with column slicing?
[1] 1248361320
user system elapsed 0.005 0.000 0.006
Now the time is less than a tenth of a second, which is faster than the original code by roughly
three orders of magnitude. The moral to the story is: rather than constantly switching between C++
and R, you should try to stay in one environment for as long as possible. If your code involves a
loop with a C++ call inside (including functional maps like lapply
and friends), consider writing
the loop entirely in C++ and combine the results into a single object to return to R.
(It should be noted that this interface tax is less onerous for built-in Rcpp classes such as
NumericVector
or NumericMatrix
, which do not require making copies of the data. Sparse data
types are different, and in particular Armadillo’s sparse classes do not provide constructors that
can directly use auxiliary memory.)
Rather than taking explicit slices of the data, let us try using good old-fashioned loops over the matrix elements. This is easily coded up in Armadillo, and it turns out to be quite efficient, relatively speaking. It is not as fast as using column slicing, but much better than row slicing.
[1] 1248361320
user system elapsed 0.176 0.000 0.176
However, we can still do better. In Armadillo, the iterators for sparse matrix classes iterate only over the nonzero elements. Let us compare this to our naive double loop:
[1] 1248361320
user system elapsed 0.002 0.000 0.002
This is the best time achieved so far, to the extent that system.time
might have difficulty
capturing it. The timings are now so low that we should use the microbenchmark package to get
accurate measurements:
Unit: milliseconds expr min lq mean median uq max neval col 4.78921 4.88444 5.05229 4.99184 5.18450 5.50579 20 elem 172.84830 177.20431 179.87007 179.06447 182.08075 188.11256 20 iter 1.02268 1.05447 1.12611 1.12627 1.16482 1.30800 20
Thus, using iterators represents a greater than three-order-of-magnitude speedup over the original row-slicing code.
Rcpp
has an elegant mechanism of exception handling
whereby C++
exceptions are automatically translated to errors in R
. For most
projects, the Rcpp::stop
wrapper (in conjunction with the BEGIN_RCPP
and
END_RCPP
macros automatically inserted by
RcppAttributes
)
is sufficient and easy to use, providing an Rcpp
equivalent of base::stop
.
By default, it captures the call stack and attaches it to the exception
in R
, giving informative error messages:
Error in add1(1:5, 1:3): x and y are not the same length!
This matches the default behavior of base::stop()
which captures the call info.
For complex calling patterns (e.g., creating an argument list and calling the
Rcpp
function with do.call
), the resulting error messages are less helpful:
Error in (function (x, y) : x and y are not the same length!
If the internal error were being generated in R
code, we might choose to use
the call.=FALSE
argument to base::stop
to suppress the unhelpful (function (x, y)
part of the error message, but we don’t (immediately) have a corresponding option
in Rcpp
. In this gallery post, we show how to suppress the call-stack capture
of Rcpp::stop
to give cleaner error messages.
The key functionality was added to Rcpp
by Jim Hester in
Rcpp Pull Request #663.
To generate an R
-level exception without a call stack, we pass an optional
false
flag to Rcpp::exception
. For example,
Error: x and y are not the same length!
This can’t capture the R
level call stack, but it is at least cleaner than
the error message from the previous example.
Note that here, as elsewhere in C++
, we need to handle exceptions using a
try/catch
structure, but we do not add it explicitly because
RcppAttributes
automatically handles this for us.
Similar to Rcpp::stop
, Rcpp
also provides a warning
function to generate
R
level warnings. It has the same call-stack capture behavior as stop
.
For the direct call case:
Warning in add4(1:5, 1:3): x and y are not the same length!
[1] 2 4 6 4 5
For the indirect call case:
Warning in (function (x, y) : x and y are not the same length!
[1] 2 4 6 4 5
If we want to suppress the call stack info in this warning, we have to drop
down to the C
-level R
API. In particular, we use the Rf_warningcall
function,
which takes the call as the first argument. By passing a NULL
, we suppress the call:
Warning: x and y are not the same length!
[1] 2 2 3 4 5
The above methods work, but they are not as clean as their Rcpp::stop
and
Rcpp::warning
counterparts. We can take advantage of C++11
to provide
similar functionality for our call-free versions.
Basing our implementation on the C++11
implementation
of Rcpp::stop
and Rcpp::warning
we can define our own
stopNoCall
and warningNoCall
Warning: x and y are not the same length!
[1] 2 4 6 4 5
Error: x and y are not the same length!
Note that we used C++11
variadic templates here – if we wanted to do something
similar in C++98,
we could use essentially the same pattern, but would need to
implement each case individually.
The RcppArrayFire package provides an interface from R to and from the ArrayFire library, an open source library that can make use of GPUs and other hardware accelerators via CUDA or OpenCL.
The official R bindings expose ArrayFire data structures as S4 objects in R, which would require a large amount of code to support all the methods defined in ArrayFire’s C/C++ API. RcppArrayFire instead, which is derived from RcppFire by Kazuki Fukui, follows the lead of packages like RcppArmadillo or RcppEigen to provide seamless communication between R and ArrayFire at the C++ level.
Please note that RcppArrayFire is developed and tested on Linux systems. There is preliminary support for Mac OS X.
In order to use RcppArrayFire you will need the ArrayFire library and header files. While ArrayFire has been packaged for Debian, I currently prefer using upstream’s binary installer or building from source.
RcppArrayFire is not on CRAN, but you can install the current version via drat:
If you have installed ArrayFire in a non-standard directory, you have to use the
configure argument --with-arrayfire
, e.g.:
Let’s look at the classical example of calculating via simulation. The basic idea is to generate a large number of random points within the unit square. An approximation for can then be calculated from the ratio of points within the unit circle to the total number of points. A vectorized implementation in R might look like this:
pi ~= 3.13999
user system elapsed 0.102 0.009 0.111
A simple way to use C++ code in R is to use the inline package or cppFunction()
from Rcpp, which are both possible with RcppArrayFire. An implementation in C++
using ArrayFire might look like this:
pi ~= 3.14279
user system elapsed 0.000 0.001 0.000
Several things are worth noting:
The syntax is almost identical. Besides the need for using types and a
different function name when generating random numbers, the argument f32
to
randu
as well as the float
type catches the eye. These instruct ArrayFire to
use single precision floats, since not all devices support double precision
floating point numbers. If you want and can to use double precision, you have to
specify f64
and double
.
The results are not the same since ArrayFire uses a different random number generator.
The speed-up can be quite impressive. However, the first invocation of a function is often not as fast as expected due to the just-in-time compilation used by ArrayFire. This can be circumvented by using a warm-up call with (normally) fewer computations.
Up to now we have only considered simple types like double
or int
as function
parameters and return values. However, we can also use arrays. Consider the case
of an European put option that was recently handled with R, Rcpp and RcppArmadillo.
The Armadillo based function from this post reads:
This function can be applied to a range of spot prices:
[,1] [1,] 5.52021 [2,] 4.58142 [3,] 3.68485 [4,] 2.85517 [5,] 2.11883 [6,] 1.49793
Porting this code to RcppArrayFire is straight forward:
Compared with the implementations in R, Rcpp and RcppArmadillo the syntax is again almost the same. One exception is that ArrayFire does not contain a function for the cumulative normal distribution function. However, the closely related error function is available.
Since an object of type af::array
can contain different data types,
the templated wrapper class RcppArrayFire::typed_array<>
is used to indicate the
desired data type when converting from R to C++. Again single precision floats are
used with ArrayFire, which leads to differences of the order compared to the
results from R, Rcpp and RcppArmadillo:
[1] 5.52021 4.58143 3.68485 2.85516 2.11883 1.49793
The reason to use hardware accelerators is of course the quest for increased performance. How does ArrayFire fare in this respect? Using the same benchmark as in the R, Rcpp and RcppArmadillo comparison:
test replications elapsed relative 2 AF 100 0.471 1.000 1 Arma 100 5.923 12.575
Here a Nvidia GeForce GT 1030 is used together with ArrayFire’s CUDA backend. With a build-in Intel HD Graphics 520 using the OpenCL backend the ArrayFire solution is about 6 times faster. Even without a high performance GPU the performance boost from using ArrayFire can be quite impressive. However, the results change dramatically, if fewer options are evaluated:
test replications elapsed relative 1 Arma 1000 0.008 1.000 2 AF 1000 0.123 15.375
But is it realistic to process options at once? Probably not in the way used in the benchmark where only the spot price is allowed to vary. However, one can alter the function to process not only arrays of spot prices but also arrays of strikes, risk free rates etc.:
Note that ArrayFire does not recycle elements if arrays with non-matching dimensions are combined. In this particular case this means that all arrays must have the same length. One can ensure that by using a data frame for the values:
s k r y t sigma p 1 87.4192 50 0.01 0.02 1 0.05 7.44673e-29 2 48.7060 50 0.01 0.02 1 0.05 2.09401e+00 3 67.2626 50 0.01 0.02 1 0.05 2.34556e-09 4 72.6573 50 0.01 0.02 1 0.05 6.84457e-14 5 68.0854 50 0.01 0.02 1 0.05 5.26653e-10 6 57.8775 50 0.01 0.02 1 0.05 2.57750e-03
The ArrayFire library provides a convenient way to use hardware accelerators without the need to write low-level OpenCL or CUDA code. The C++ syntax is actually quite similar to properly vectorized R code. The RcppArrayFire package makes this available to useRs. However, one still has to be careful: using hardware accelerators is not a “silver bullet” due to the inherent memory transfer overhead.
In the quest for ever faster code, one generally begins exploring ways to integrate C++ with R using Rcpp. This post provides an example of multiple implementations of a European Put Option pricer. The implementations are done in pure R, pure Rcpp using some Rcpp sugar functions, and then in Rcpp using RcppArmadillo, which exposes the incredibly powerful linear algebra library, Armadillo.
Under the Black-Scholes model The value of a European put option has the closed form solution:
where
and
Armed with the formulas, we can create the pricer using just R.
[1] 5.52021
[1] 5.52021 4.58142 3.68485 2.85517 2.11883 1.49793
Let’s see what we can do with Rcpp. Besides explicitely stating the
types of the variables, not much has to change. We can even use the sugar function,
Rcpp::pnorm()
, to keep the syntax as close to R as possible. Note how we are being
explicit about the symbols we import from the Rcpp
namespace: the basic vector type, and
well the (vectorized) ‘Rcpp Sugar’ calls log()
and pnorm()
calls. Similarly, we use
sqrt()
and exp()
for the calls on an atomic double
variables from the C++ Standard
Library. With these declarations the code itself is essentially identical to the R code
(apart of course from requiring both static types and trailing semicolons).
We can call this from R as well:
[1] 5.52021
[1] 5.52021 4.58142 3.68485 2.85517 2.11883 1.49793
Finally, let’s look at
RcppArmadillo. Armadillo has a
number of object types, including mat
, colvec
, and rowvec
. Here, we just use
colvec
to represent a column vector of prices. By default in Armadillo, *
represents
matrix multiplication, and %
is used for element wise multiplication. We need to make
this change to element wise multiplication in 1 place, but otherwise the changes are just
switching out the types and the sugar functions for Armadillo specific functions.
Note that the arma::normcdf()
function is in the upcoming release of
RcppArmadillo, which is
0.8.400.0.0
at the time of writing and still in CRAN’s incoming. It also requires the
C++11
plugin.
Use from R:
[,1] [1,] 5.52021
[,1] [1,] 5.52021 [2,] 4.58142 [3,] 3.68485 [4,] 2.85517 [5,] 2.11883 [6,] 1.49793
Finally, we can run a speed test to see which comes out on top.
test replications elapsed relative 2 Arma 100 6.409 1.000 3 Rcpp 100 7.917 1.235 1 R 100 9.091 1.418
Interestingly, Armadillo comes out on top here on this (multi-core) machine (as Armadillo uses OpenMP where available in newer versions). But the difference is slender, and there is certainly variation in repeated runs. And the nicest thing about all of this is that it shows off the “embarassment of riches” that we have in the R and C++ ecosystem for multiple ways of solving the same problem.
ext install language-julia
and hit Enter. This will install the Julia extension for Visual Studio Code. After installation, you can load the Julia REPL by pressing the following keys Ctrl+Shift+P (Windows) or Cmd+Shift+P (Mac) and enter julia start repl
, and press Enter. If there is an error, the path may need to be specified properly. To do this, go to Preferences > Settings. Then in the .json user file settings, enter the following:Julia-0.6.0-rc3
(Windows) or Julia-0.6.app
(Mac) with the desired version of your Julia, and the C:/Users/MyName
with your desired path. Further, I use the following setting in my .json file to adjust my Minimap similar to the screenshot above.by Unknown (noreply@blogger.com) at November 07, 2017 04:35 AM
gamma
to .01 in the codes above) the algorithm will converge at 42nd iteration. To support that claim, see the steps of its gradient in the plot below.beta_new
to .1) with $\gamma=.01$, the algorithm converges at 173rd iteration with estimate $\hat{\beta}_{173}=2.249962\approx\frac{9}{4}$ (see the plot below).The evaluation of peritoneal dialysis (PD) programmes requires the use of statistical methods that suit the complexity of such programmes. Multi-state regression models taking competing risks into account are a good example of suitable approaches. In this work, multi-state structured additive regression (STAR) models combined with penalized splines (P-splines) are proposed to evaluate peritoneal dialysis programmes. These models are very flexible since they may consider smooth estimates of baseline transition intensities and the inclusion of time-varying and smooth covariate effects at each transition. A key issue in survival analysis is the quantification of the time-dependent predictive accuracy of a given regression model, which is typically assessed using receiver operating characteristic (ROC)’based methodologies. The main objective of the present study is to adapt the concept of time-dependent ROC curve, and their corresponding area under the curve (AUC), to a multi-state competing risks framework. All statistical methodologies discussed in this work were applied to PD survival data. Using a multi-state competing risks framework, this study explored the effects of major clinical covariates on survival such as age, sex, diabetes and previous renal replacement therapy. Such multi-state model was composed of one transient state (peritonitis) and several absorbing states (death, transfer to haemodialysis and renal transplantation). The application of STAR models combined with time-dependent ROC curves revealed important conclusions not previously reported in the nephrology literature when using standard statistical methodologies. For practical application, all the statistical methods proposed in this article were implemented in R and we wrote and made available a script named as NestedCompRisks.
by Teixeira, L., Cadarso-Suarez, C., Rodrigues, A., Mendonca, D. at September 29, 2016 05:16 AM
Index measures are commonly used in medical research and clinical practice, primarily for quantification of health risks in individual subjects or patients. The utility of an index measure is ultimately contingent on its ability to predict health outcomes. Construction of medical indices has largely been based on heuristic arguments, although the acceptance of a new index typically requires objective validation, preferably with multiple outcomes. In this article, we propose an analytical tool for index development and validation. We use a multivariate single-index model to ascertain the best functional form for risk index construction. Methodologically, the proposed model represents a multivariate extension of the traditional single-index models. Such an extension is important because it assures that the resultant index simultaneously works for multiple outcomes. The model is developed in the general framework of longitudinal data analysis. We use penalized cubic splines to characterize the index components while leaving the other subject characteristics as additive components. The splines are estimated directly by penalizing nonlinear least squares, and we show that the model can be implemented using existing software. To illustrate, we examine the formation of an adiposity index for prediction of systolic and diastolic blood pressure in children. We assess the performance of the method through a simulation study.
The shared frailty model is a popular tool to analyze correlated right-censored time-to-event data. In the shared frailty model, the latent frailty is assumed to be shared by the members of a cluster and is assigned a parametric distribution, typically a gamma distribution due to its conjugacy. In the case of interval-censored time-to-event data, the inclusion of frailties results in complicated intractable likelihoods. Here, we propose a flexible frailty model for analyzing such data by assuming a smooth semi-parametric form for the conditional time-to-event distribution and a parametric or a flexible form for the frailty distribution. The results of a simulation study suggest that the estimation of regression parameters is robust to misspecification of the frailty distribution (even when the frailty distribution is multimodal or skewed). Given sufficiently large sample sizes and number of clusters, the flexible approach produces smooth and accurate posterior estimates for the baseline survival function and for the frailty density, and it can correctly detect and identify unusual frailty density forms. The methodology is illustrated using dental data from the Signal Tandmobiel® study.
To represent the complex structure of intensive longitudinal data of multiple individuals, we propose a hierarchical Bayesian Dynamic Model (BDM). This BDM is a generalized linear hierarchical model where the individual parameters do not necessarily follow a normal distribution. The model parameters can be estimated on the basis of relatively small sample sizes and in the presence of missing time points. We present the BDM and discuss the model identification, convergence and selection. The use of the BDM is illustrated using data from a randomized clinical trial to study the differential effects of three treatments for panic disorder. The data involves the number of panic attacks experienced weekly (73 individuals, 10–52 time points) during treatment. Presuming that the counts are Poisson distributed, the BDM considered involves a linear trend model with an exponential link function. The final model included a moving average parameter and an external variable (duration of symptoms pre-treatment). Our results show that cognitive behavioural therapy is less effective in reducing panic attacks than serotonin selective re-uptake inhibitors or a combination of both. Post hoc analyses revealed that males show a slightly higher number of panic attacks at the onset of treatment than females.
by Krone, T., Albers, C., Timmerman, M. at September 29, 2016 05:16 AM
Most R users will know that data frames are lists. You can easily verify that a data frame is a list by typing
d <- data.frame(id=1:2, name=c("Jon", "Mark")) d
id name 1 1 Jon 2 2 Mark
is.list(d)
[1] TRUE
However, data frames are lists with some special properties. For example, all entries in the list must have the same length (here 2), etc. You can find a nice description of the differences between lists and data frames here. To access the first column of d, we find that it contains a vector (and a factor in case of column name). Note, that [[ ]] is an operator to select a list element. As data frames are lists, they will work here as well.
is.vector(d[[1]])
[1] TRUE
A long time, I was unaware of the fact, that data frames may also contain lists as columns instead of vectors. For example, let’s assume Jon’s children are Mary and James, and Mark’s children are called Greta and Sally. Their names are stored in a list with two elements. We can add them to the data frame like this:
d$children <- list(c("Mary", "James"), c("Greta", "Sally")) d
id name children 1 1 Jon Mary, James 2 2 Mark Greta, Sally
A single data frame entry in column children now contains more than one value. Given that the column is a list, not a vector, we cannot go as usual when modifying an entry of the column. For example, to change Jon’s children, we cannot do
> d[1 , "children"] <- c("Mary", "James", "Thomas") Error in `[<-.data.frame`(`*tmp*`, 1, "children", value = c("Mary", "James", : replacement has 3 rows, data has 1
Taking into account the list structure of the column, we can type the following to change the values in a single cell.
d[1 , "children"][[1]] <- list(c("Mary", "James", "Thomas")) # or also d$children[1] <- list(c("Mary", "James", "Thomas")) d
id name children 1 1 Jon Mary, James, Thomas 2 2 Mark Greta, Sally
You can also create a data frame having a list as a column using the <tt>data.frame</tt> function, but with a little tweak. The list column has to be wrapped inside the function <tt>I</tt>. This will protect it from several conversions taking place in <tt>data.frame</tt> (see <tt>?I</tt> documentation).
d <- data.frame(id = 1:2, name = c("Jon", "Mark"), children = I(list(c("Mary", "James"), c("Greta", "Sally"))) )
This is an interesting feature, which gives me a deeper understanding of what a data frame is. But when exactly would I want to use it? I have not encountered the need to use it very often yet (though of course there may be plenty of situations where it makes sense). But today I had a case where this feature seemed particularly useful.
I had two separate types of information. One stored in a data frame and the other one in a list Referring to the example above, I had
d <- data.frame(id=1:2, name=c("Jon", "Mark")) d
id name 1 1 Jon 2 2 Mark
and
ch <- list(c("Mary", "James"), c("Greta", "Sally")) ch
[[1]] [1] "Mary" "James" [[2]] [1] "Greta" "Sally"
I needed to return an array of JSON objects which look like this.
[ { "id": 1, "name": "Jon", "children": ["Mary", "James"] }, { "id": 2, "name": "Mark", "children": ["Greta", "Sally"] } ]
Working with the superb jsonlite package to convert R to JSON, I could do the following to get the result above.
library(jsonlite) l <- split(d, seq(nrow(d))) # convert data frame rows to list l <- unname(l) # remove list names for (i in seq_along(l)) # add element from ch to list l[[i]] <- c(l[[i]], children=ch[i]) toJSON(l, pretty=T, auto_unbox = T) # convert to JSON
The results are correct, but getting there involved quite a number of tedious steps. These can be avoided by directly placing the list into a column of the data frame. Then jsonlite::toJSON takes care of the rest.
d$children <- list(c("Mary", "James"), c("Greta", "Sally")) toJSON(d, pretty=T, auto_unbox = T)
[ { "id": 1, "name": "Jon", "children": ["Mary", "James"] }, { "id": 2, "name": "Mark", "children": ["Greta", "Sally"] } ]
Nice :) What we do here, is basically creating the same nested list structure as above, only now it is disguised as a data frame. However, this approach is much more convenient.
princomp
as seen below:princomp
consist of a list shown above, we will give description to selected outputs. Others can be found in the documentation of the function by executing ?princomp
. sdev
- standard deviation, the square root of the eigenvalues $\lambda$ of the variance-covariance matrix $\mathbf{\Sigma}$ of the data, dat.mat
;loadings
- eigenvectors $\mathbf{e}$ of the variance-covariance matrix $\mathbf{\Sigma}$ of the data, dat.mat
;scores
- the principal component scores.loadings
which is a matrix of eigenvectors, where the columns corresponds to the eigenvectors of the sequence of principal components, that is if the first principal component is given by $Y_1=\mathbf{u}_1^T\mathbf{X}$, then the estimate of $\mathbf{u}_1$ which is $\mathbf{e}_1$ (eigenvector) is the set of coefficients obtained from the first column of the loadings
. The explained variability of the first principal component is the square of the first standard deviation sdev
, the explained variability of the second principal component is the square of the second standard deviation sdev
, and so on. Now let's interpret the loadings (coefficients) of the first three principal components. Below is the plot of this, PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 |
---|---|---|---|---|---|---|---|---|---|
Table 1: Variability Explained by the First Ten Principal Components for the AVIRIS data. | |||||||||
82.057 | 17.176 | 0.320 | 0.182 | 0.094 | 0.065 | 0.037 | 0.029 | 0.014 | 0.005 |
sdev
) of the variance-covariance matrix $\mathbf{\Sigma}$ of the original variable $\mathbf{X}$, hence the percentage of variability explained by the $j$th PC is equal to its corresponding eigenvalue $\lambda_j$ divided by the overall variability which is the sum of the eigenvalues, $\sum_{j=1}^{p}\lambda_j$, as we see in the following code,Click on the image to zoom in. |
by Unknown (noreply@blogger.com) at December 27, 2015 01:52 AM
k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. (Wikipedia, Ref 1.)We will apply this method to an image, wherein we group the pixels into k different clusters. Below is the image that we are going to use,
Colorful Bird From Wall321 |
by Unknown (noreply@blogger.com) at December 27, 2015 01:52 AM
Here’s a phrase you never want to see in print (in a legal decision, no less) pertaining to your academic research: “The IRB process, however, was improperly engaged by the Dartmouth researcher and ignored completely by the Stanford researchers.”
Whole thing here; it’s a doozy.
Today I was working on a two-part procrustes problem and wanted to find out why my minimization algorithm sometimes does not converge properly or renders unexpected results. The loss function to be minimized is
with denoting the Frobenius norm, is an unknown scalar and an unknown rotation matrix, i.e. . , and are four real valued matrices. The minimum for is easily found by setting the partial derivation of w.r.t equal to zero.
By plugging into the loss function we get a new loss function that only depends on . This is the starting situation.
When trying to find out why the algorithm to minimize did not work as expected, I got stuck. So I decided to conduct a small simulation and generate random rotation matrices to study the relation between the parameter and the value of the loss function . Before looking at the results for the entire two-part procrustes problem from above, let’s visualize the results for the first part of the loss function only, i.e.
Here, has the same minimum as for the whole formula above. For the simulation I used
as input matrices. Generating many random rotation matrices and plotting against the value of the loss function yields the following plot.
This is a well behaved relation, for each scaling parameter the loss is identical. Now let’s look at the full two-part loss function. As input matrices I used
and the following R-code.
# trace function tr <- function(X) sum(diag(X)) # random matrix type 1 rmat_1 <- function(n=3, p=3, min=-1, max=1){ matrix(runif(n*p, min, max), ncol=p) } # random matrix type 2, sparse rmat_2 <- function(p=3) { diag(p)[, sample(1:p, p)] } # generate random rotation matrix Q. Based on Q find # optimal scaling factor c and calculate loss function value # one_sample <- function(n=2, p=2) { Q <- mixAK::rRotationMatrix(n=1, dim=p) %*% # random rotation matrix det(Q) = 1 diag(sample(c(-1,1), p, rep=T)) # additional reflections, so det(Q) in {-1,1} s <- tr( t(Q) %*% t(A1) %*% B1 ) / norm(A1, "F")^2 # scaling factor c rss <- norm(s*A1 %*% Q - B1, "F")^2 + # get residual sum of squares norm(A2 %*% Q - B2, "F")^2 c(s=s, rss=rss) } # find c and rss or many random rotation matrices # set.seed(10) # nice case for 3 x 3 n <- 3 p <- 3 A1 <- round(rmat_1(n, p), 1) B1 <- round(rmat_1(n, p), 1) A2 <- rmat_2(p) B2 <- rmat_2(p) x <- plyr::rdply(40000, one_sample(3,3)) plot(x$s, x$rss, pch=16, cex=.4, xlab="c", ylab="L(Q)", col="#00000010")
This time the result turns out to be very different and … beautiful :)
Here, we do not have a one to one relation between the scaling parameter and the loss function any more. I do not quite know what to make of this yet. But for now I am happy that it has aestethic value. Below you find some more beautiful graphics with different matrices as inputs.
Cheers!
RCall
: Running an embedded R in JuliaRCall
.RCall
is to access datasets from R and R packages, to fit models that are not currently implemented in Julia and to use R graphics, especially the ggplot2 and lattice packages. Unfortunately I am not currently able to start a graphics device from the embedded R but I expect that to be fixed soon.RCall
although it may not mean much if you haven't tried to do this kind of thing. It is written entirely in Julia. There is absolutely no "glue" code written in a compiled language like C or C++. As I said, this may not mean much to you unless you have tried to do something like this, in which case it is astonishing.by Douglas Bates (noreply@blogger.com) at February 24, 2015 11:05 PM
by Unknown (noreply@blogger.com) at January 16, 2015 10:22 PM
by Unknown (noreply@blogger.com) at January 15, 2015 10:16 PM
We teach two software packages, R and SPSS, in Quantitative Methods 101 for psychology freshman at Bremen University (Germany). Sometimes confusion arises, when the software packages produce different results. This may be due to specifics in the implemention of a method or, as in most cases, to different default settings. One of these situations occurs when the QQ-plot is introduced. Below we see two QQ-plots, produced by SPSS and R, respectively. The data used in the plots were generated by:
set.seed(0) x <- sample(0:9, 100, rep=T)
SPSS
R
qqnorm(x, datax=T) # uses Blom's method by default qqline(x, datax=T)
There are some obvious differences:
To get a better understanding of the difference we will build the R and SPSS-flavored QQ-plot from scratch.
In order to calculate theoretical quantiles corresponding to the observed values, we first need to find a way to assign a probability to each value of the original data. A lot of different approaches exist for this purpose (for an overview see e.g. Castillo-Gutiérrez, Lozano-Aguilera, & Estudillo-Martínez, 2012b). They usually build on the ranks of the observed data points to calculate corresponding p-values, i.e. the plotting positions for each point. The qqnorm function uses two formulae for this purpose, depending on the number of observations (Blom’s mfethod, see ?qqnorm; Blom, 1958). With being the rank, for it will use the formula , for the formula to determine the probability value for each observation (see the help files for the functions qqnorm and ppoint). For simplicity reasons, we will only implement the case here.
n <- length(x) # number of observations r <- order(order(x)) # order of values, i.e. ranks without averaged ties p <- (r - 1/2) / n # assign to ranks using Blom's method y <- qnorm(p) # theoretical standard normal quantiles for p values plot(x, y) # plot empirical against theoretical values
Before we take at look at the code, note that our plot is identical to the plot generated by qqnorm above, except that the QQ-line is missing. The main point that makes the difference between R and SPSS is found in the command order(order(x)). The command calculates ranks for the observations using ordinal ranking. This means that all observations get different ranks and no average ranks are calculated for ties, i.e. for observations with equal values. Another approach would be to apply fractional ranking and calculate average values for ties. This is what the function rank does. The following codes shows the difference between the two approaches to assign ranks.
v <- c(1,1,2,3,3) order(order(v)) # ordinal ranking used by R
## [1] 1 2 3 4 5
rank(v) # fractional ranking used by SPSS
## [1] 1.5 1.5 3.0 4.5 4.5
R uses ordinal ranking and SPSS uses fractional ranking by default to assign ranks to values. Thus, the positions do not overlap in R as each ordered observation is assigned a different rank and therefore a different p-value. We will pick up the second approach again later, when we reproduce the SPSS-flavored plot in R.^{1}
The second difference between the plots concerned the scaling of the y-axis and was already clarified above.
The last point to understand is how the QQ-line is drawn in R. Looking at the probs argument of qqline reveals that it uses the 1st and 3rd quartile of the original data and theoretical distribution to determine the reference points for the line. We will draw the line between the quartiles in red and overlay it with the line produced by qqline to see if our code is correct.
plot(x, y) # plot empirical against theoretical values ps <- c(.25, .75) # reference probabilities a <- quantile(x, ps) # empirical quantiles b <- qnorm(ps) # theoretical quantiles lines(a, b, lwd=4, col="red") # our QQ line in red qqline(x, datax=T) # R QQ line
The reason for different lines in R and SPSS is that several approaches to fitting a straight line exist (for an overview see e.g. Castillo-Gutiérrez, Lozano-Aguilera, & Estudillo-Martínez, 2012a). Each approach has different advantages. The method used by R is more robust when we expect values to diverge from normality in the tails, and we are primarily interested in the normality of the middle range of our data. In other words, the method of fitting an adequate QQ-line depends on the purpose of the plot. An explanation of the rationale of the R approach can e.g. be found here.
The default SPSS approach also uses Blom’s method to assign probabilities to ranks (you may choose other methods is SPSS) and differs from the one above in the following aspects:
n <- length(x) # number of observations r <- rank(x) # a) ranks using fractional ranking (averaging ties) p <- (r - 1/2) / n # assign to ranks using Blom's method y <- qnorm(p) # theoretical standard normal quantiles for p values y <- y * sd(x) + mean(x) # b) transform SND quantiles to mean and sd from original data plot(x, y) # plot empirical against theoretical values
Lastly, let us add the line. As the scaling of both axes is the same, the line goes through the origin with a slope of .
abline(0,1) # c) slope 0 through origin
The comparison to the SPSS output shows that they are (visually) identical.
The whole point of this demonstration was to pinpoint and explain the differences between a QQ-plot generated in R and SPSS, so it will no longer be a reason for confusion. Note, however, that SPSS offers a whole range of options to generate the plot. For example, you can select the method to assign probabilities to ranks and decide how to treat ties. The plots above used the default setting (Blom’s method and averaging across ties). Personally I like the SPSS version. That is why I implemented the function qqnorm_spss in the ryouready package, that accompanies the course. The formulae for the different methods to assign probabilities to ranks can be found in Castillo-Gutiérrez et al. (2012b). The implentation is a preliminary version that has not yet been thoroughly tested. You can find the code here. Please report any bugs or suggestions for improvements (which are very welcome) in the github issues section.
library(devtools) install_github("markheckmann/ryouready") # install from github repo library(ryouready) # load package library(ggplot2) qq <- qqnorm_spss(x, method=1, ties.method="average") # Blom's method with averaged ties plot(qq) # generate QQ-plot ggplot(qq) # use ggplot2 to generate QQ-plot
by Unknown (noreply@blogger.com) at October 20, 2014 04:24 PM
Assuming that the “no” vote prevails in the Scottish independence referendum, the next question for the United Kingdom is to consider constitutional reform to implement a quasi-federal system and resolve the West Lothian question once and for all. In some ways, it may also provide an opportunity to resolve the stalled reform of the upper house as well. Here’s the rough outline of a proposal that might work.
Devolve identical powers to England, Northern Ireland, Scotland, and Wales, with the proviso that local self-rule can be suspended if necessary by the federal legislature (by a supermajority).
The existing House of Commons becomes the House of Commons for England, which (along with the Sovereign) shall comprise the English Parliament. This parliament would function much as the existing devolved legislatures in Scotland and Wales; the consociational structure of the Northern Ireland Assembly (requiring double majorities) would not be replicated.
The House of Lords is abolished, and replaced with a directly-elected Senate of the United Kingdom. The Senate will have authority to legislate on the non-devolved powers (in American parlance, “delegated” powers) such as foreign and European Union affairs, trade and commerce, national defense, and on matters involving Crown dependencies and territories, the authority to legislate on devolved matters in the event self-government is suspended in a constituent country, and dilatory powers including a qualified veto (requiring a supermajority) over the legislation proposed by a constituent country’s parliament. The latter power would effectively replace the review powers of the existing House of Lords; it would function much as the Council of Revision in Madison’s original plan for the U.S. Constitution.
As the Senate will have relatively limited powers, it need not be as large as the existing Lords or Commons. To ensure the countries other than England have a meaningful voice, given that nearly 85% of the UK’s population is in England, two-thirds of the seats would be allocated proportionally based on population and one-third allocated equally to the four constituent countries. This would still result in a chamber with a large English majority (around 64.4%) but nonetheless would ensure the other three countries would have meaningful representation as well.
Kevin Drum asks a bunch of questions about soccer:
The first one’s easy: there’s a general rule that no defensive player can be within 10 yards of the spot of a direct free kick. A penalty kick (which is a type of direct free kick) takes place in the 18-yard box, and no players other than the player taking the kick and the goalkeeper are allowed in the box. However, owing to geometry, the 18 yard box and the 10 yard exclusion zone don’t fully coincide, hence the penalty arc. (That’s also why there are two tiny hash-marks on the goal line and side line 10 yards from the corner flag. And why now referees have a can of shaving cream to mark the 10 yards for other free kicks, one of the few MLS innovations that has been a good idea.)
Second one’s also easy: the half and the game cannot end while the ball is out of play.
Third one’s harder. First, keeping time inexactly forestalls the silly premature celebrations that are common in most US sports. You’d never see the Stanford-Cal play happen in a soccer game. Second, it allows some slippage for short delays and doesn’t require exact timekeeping; granted, this was more valuable before instant replays and fourth officials, but most US sports require a lot of administrative record-keeping by ancillary officials. A soccer game can be played with one official (and often is, particularly at the amateur level) without having to change timing rules;* in developing countries in particular this lowers the barriers to entry for the sport (along with the low equipment requirements) without changing the nature of the game appreciably. Perhaps most importantly, if the clock was allowed to stop regularly it would create an excuse for commercial timeouts and advertising breaks, which would interrupt the flow of the game and potentially reduce the advantages of better-conditioned and more skilled athletes. (MLS tried this, along with other exciting American ideas like “no tied games,” and it was as appealing to actual soccer fans as ketchup on filet mignon would be to a foodie, and perhaps more importantly didn’t make any non-soccer fans watch.)
Fourth, the key distinction is usually whether there was an obvious attempt to play the ball; in addition, in the modern game, even some attempts to play the ball are considered inherently dangerous (tackling from behind, many sliding tackles, etc.) and therefore are fouls even if they are successful in getting more ball than human.
* To call offside, you’d also probably need what in my day we called a “linesman.”
Probably the worst-kept non-secret is that the next stage of the institutional evolution of my current employer is to some ill-defined concept of “university status,” which mostly involves the establishment of some to-be-determined master’s degree programs. In the context of the University System of Georgia, it means a small jump from the “state college” prestige tier (a motley collection of schools that largely started out as two-year community colleges and transfer institutions) to the “state university” tier (which is where most of the ex-normal schools hang out these days). What is yet to be determined is how that transition will affect the broader institution that will be the University of Middle Georgia.* People on high are said to be working on these things; in any event, here are my assorted random thoughts on what might be reasonable things to pursue:
* There is no official name-in-waiting, but given that our former interim president seemed to believe he could will this name into existence by repeating it enough I’ll stick with it. The straw poll of faculty trivia night suggests that it’s the least bad option available, which inevitably means the regents will choose something else instead (if the last name change is anything to go by).
I've been putting off sharing this idea because I've heard the rumors about what happens to folks who aren't security experts when they post about security on the internet. If this blog is replaced with cat photos and rainbows, you'll know what happened.
It's 2014 and chances are you have accounts on websites that are not properly handling user passwords. I did no research to produce the following list of ways passwords are mishandled in decreasing order of frequency:
SHA1(salt + plain-password)
.We know that sites should be generating secure random salts and using an established slow hashing algorithm (bcrypt, scrypt, or PBKDF2). Why are sites not doing this?
While security issues deserve a top spot on any site's priority list, new features often trump addressing legacy security concerns. The immediacy of the risk is hard to quantify and it's easy to fall prey to a "nothing bad has happened yet, why should we change now" attitude. It's easy for other bugs, features, or performance issues to win out when measured by immediate impact. Fixing security or other "legacy" issues is the Right Thing To Do and often you will see no measurable benefit from the investment. It's like having insurance. You don't need it until you do.
Specific to the improper storage of user password data is the issue of the impact to a site imposed by upgrading. There are two common approaches to upgrading password storage. You can switch cold turkey to the improved algorithms and force password resets on all of your users. Alternatively, you can migrate incrementally such that new users and any user who changes their password gets the increased security.
The cold turkey approach is not a great user experience and sites might choose to delay an upgrade to avoid admitting to a weak security implementation and disrupting their site by forcing password resets.
The incremental approach is more appealing, but the security benefit is drastically diminished for any site with a substantial set of existing users.
Given the above migration choices, perhaps it's (slightly) less surprising that businesses choose to prioritize other work ahead of fixing poorly stored user password data.
What if you could upgrade a site so that both new and existing users immediately benefited from the increased security, but without the disruption of password resets? It turns out that you can and it isn't very hard.
Consider a user table with columns:
userid
salt
hashed_pass
Where the hashed_pass
column is computed using a weak fast
algorithm, for example SHA1(salt + plain_pass)
.
The core of the idea is to apply a proper algorithm on top of the data
we already have. I'll use bcrypt
to make the discussion
concrete. Add columns to the user table as follows:
userid
salt
hashed_pass
hash_type
salt2
Process the existing user table by computing bcrypt(salt2 +
hashed_pass)
and storing the result in the hashed_pass
column
(overwriting the less secure value); save the new salt value to
salt2
and set hash_type
to bycrpt+sha1
.
To verify a user where hash_type
is bcrypt+sha1
, compute
bcrypt(salt2 + SHA1(salt + plain_pass))
and compare to the
hashed_pass
value. Note that bcrypt implementations encode the salt
as a prefix of the hashed value so you could avoid the salt2
column,
but it makes the idea easier to explain to have it there.
You can take this approach further and have any user that logs in (as
well as new users) upgrade to a "clean" bcrypt only algorithm since
you can now support different verification algorithms using
hash_type
. With the proper application code changes in place, the
upgrade can be done live.
This scheme will also work for sites storing non-salted password hashes as well as those storing plain text passwords (THE HORROR).
Perhaps this approach makes implementing a password storage security upgrade more palatable and more likely to be prioritized. And if there's a horrible flaw in this approach, maybe you'll let me know without turning this blog into a tangle of cat photos and rainbows.
If you use rebar to generate an OTP release project and want to
have reproducible builds, you need the rebar_lock_deps_plugin
plugin. The plugin provides a lock-deps
command that will generate a
rebar.config.lock
file containing the complete flattened set of
project dependencies each pegged to a git SHA. The lock file acts
similarly to Bundler's Gemfile.lock
file and allows for reproducible
builds (*).
Without lock-deps
you might rely on the discipline of using a tag
for all of your application's deps. This is insufficient if any dep
depends on something not specified as a tag. It can also be a problem
if a third party dep doesn't provide a tag. Generating a
rebar.config.lock
file solves these issues. Moreover, using
lock-deps
can simplify the work of putting together a release
consisting of many of your own repos. If you treat the master branch
as shippable, then rather than tagging each subproject and updating
rebar.config
throughout your project's dependency chain, you can
run get-deps
(without the lock file), compile
, and re-lock at the
latest versions throughout your project repositories.
The reproducibility of builds when using lock-deps
depends on the
SHAs captured in rebar.config.lock
. The plugin works by scanning the
cloned repos in your project's deps
directory and extracting the
current commit SHA. This works great until a repository's history is
rewritten with a force push. If you really want reproducible builds,
you need to not nuke your SHAs and you'll need to fork all third party
repos to ensure that someone else doesn't screw you over in this
fashion either. If you make a habit of only depending on third party
repos using a tag, assume that upstream maintainers are not completely
bat shit crazy, and don't force push your master branch, then you'll
probably be fine.
Install the plugin in your project by adding the following to your
rebar.config
file:
%% Plugin dependency
{deps, [
{rebar_lock_deps_plugin, ".*",
{git, "git://github.com/seth/rebar_lock_deps_plugin.git", {branch, "master"}}}
]}.
%% Plugin usage
{plugins, [rebar_lock_deps_plugin]}.
To test it out do:
rebar get-deps
# the plugin has to be compiled so you can use it
rebar compile
rebar lock-deps
If you'd like to take a look at a project that uses the plugin, take a look at CHEF's erchef project.
If you are building an OTP release project using rebar generate
then
you can use rebar_lock_deps_plugin
to enhance your build experience
in three easy steps.
Use rebar bump-rel-version version=$BUMP
to automate the process
of editing rel/reltool.config
to update the release version. The
argument $BUMP
can be major
, minor
, or patch
(default) to
increment the specified part of a semver X.Y.Z
version. If
$BUMP
is any other value, it is used as the new version
verbatim. Note that this function rewrites rel/reltool.config
using ~p
. I check-in the reformatted version and maintain the
formatting when editing. This way, the general case of a version
bump via bump-rel-version
results in a minimal diff.
Autogenerate a change summary commit message for all project
deps. Assuming you've generated a new lock file and bumped the
release version, use rebar commit-release
to commit the changes
to rebar.config.lock
and rel/reltool.config
with a commit
message that summarizes the changes made to each dependency between
the previously locked version and the newly locked version. You can
get a preview of the commit message via rebar log-changed-deps
.
Finally, create an annotated tag for your new release with rebar
tag-release
which will read the current version from
rel/reltool.config
and create an annotated tag named with the
version.
Up to version 2.0.1 of rebar_lock_deps_plugin
, the dependencies in
the generated lock file were ordered alphabetically. This was a
side-effect of using filelib:wildcard/1
to list the dependencies in
the top-level deps
directory. In most cases, the order of the full
dependency set does not matter. However, if some of the code in your
project uses parse transforms, then it will be important for the parse
transform to be compiled and on the code path before attempting to
compile code that uses the parse transform.
This issue was recently discovered by a colleague who ran into build
issues using the lock file for a project that had recently integrated
lager for logging. He came up with the idea of maintaining the
order of deps as they appear in the various rebar.config
files along
with a prototype patch proving out the idea. As of
rebar_lock_deps_plugin
3.0.0, the lock-deps
command will (mostly)
maintain the relative order of dependencies as found in the
rebar.config
files.
The "mostly" is that when a dep is shared across two subprojects, it
will appear in the expected order for the first subproject (based on
the ordering of the two subprojects). The deps for the second
subproject will not be in strict rebar.config
order, but the
resulting order should address any compile-time dependencies and be
relatively stable (only changing when project deps alter their deps
with larger impact when shared deps are introduced or removed).
There are times, as a programmer, when a real-world problem looks like a text book exercise (or an interview whiteboard question). Just the other day at work we had to design some manhole covers, but I digress.
Fixing the order of the dependencies in the generated lock file is
(nearly) the same as finding an install order for a set of projects
with inter-dependencies. I had some fun coding up the text book
solution even though the approach doesn't handle the constraint of
respecting the order provided by the rebar.config
files. Onward
with the digression.
We have a set of "packages" where some packages depend on others and we want to determine an install order such that a package's dependencies are always installed before the package. The set of packages and the relation "depends on" form a directed acyclic graph or DAG. The topological sort of a DAG produces an install order for such a graph. The ordering is not unique. For example, with a single package C depending on A and B, valid install orders are [A, B, C] and [B, A, C].
To setup the problem, we load all of the project dependency
information into a proplist mapping each package to a list of its
dependencies extracted from the package's rebar.config
file.
read_all_deps(Config, Dir) ->
TopDeps = rebar_config:get(Config, deps, []),
Acc = [{top, dep_names(TopDeps)}],
DepDirs = filelib:wildcard(filename:join(Dir, "*")),
Acc ++ [
{filename:basename(D), dep_names(extract_deps(D))}
|| D <- DepDirs ].
Erlang's standard library provides the digraph and
digraph_utils modules for constructing and operating on directed
graphs. The digraph_utils
module includes a topsort/1
function
which we can make use of for our "exercise". The docs say:
Returns a topological ordering of the vertices of the digraph Digraph if such an ordering exists, false otherwise. For each vertex in the returned list, there are no out-neighbours that occur earlier in the list.
To figure out which way to point the edges when building our graph,
consider two packages A and B with A depending on B. We know we want
to end up with an install order of [B, A]. Rereading the topsort/1
docs, we must want an edge B => A
. With that, we can build our DAG
and obtain an install order with the topological sort:
load_digraph(Config, Dir) ->
AllDeps = read_all_deps(Config, Dir),
G = digraph:new(),
Nodes = all_nodes(AllDeps),
[ digraph:add_vertex(G, N) || N <- Nodes ],
%% If A depends on B, then we add an edge A <= B
[
[ digraph:add_edge(G, Dep, Item)
|| Dep <- DepList ]
|| {Item, DepList} <- AllDeps, Item =/= top ],
digraph_utils:topsort(G).
%% extract a sorted unique list of all deps
all_nodes(AllDeps) ->
lists:usort(lists:foldl(fun({top, L}, Acc) ->
L ++ Acc;
({K, L}, Acc) ->
[K|L] ++ Acc
end, [], AllDeps)).
The digraph
module manages graphs using ETS giving it a convenient
API, though one that feels un-erlang-y in its reliance on
side-effects.
The above gives an install order, but doesn't take into account the
relative order of deps as specified in the rebar.config
files. The
solution implemented in the plugin is a bit less fancy, recursing over
the deps and maintaining the desired ordering. The only tricky bit
being that shared deps are ignored until the end and the entire
linearized list is de-duped which required a . Here's the code:
order_deps(AllDeps) ->
Top = proplists:get_value(top, AllDeps),
order_deps(lists:reverse(Top), AllDeps, []).
order_deps([], _AllDeps, Acc) ->
de_dup(Acc);
order_deps([Item|Rest], AllDeps, Acc) ->
ItemDeps = proplists:get_value(Item, AllDeps),
order_deps(lists:reverse(ItemDeps) ++ Rest, AllDeps, [Item | Acc]).
de_dup(AccIn) ->
WithIndex = lists:zip(AccIn, lists:seq(1, length(AccIn))),
UWithIndex = lists:usort(fun({A, _}, {B, _}) ->
A =< B
end, WithIndex),
Ans0 = lists:sort(fun({_, I1}, {_, I2}) ->
I1 =< I2
end, UWithIndex),
[ V || {V, _} <- Ans0 ].
The great thing about posting to your blog is, you don't have to have a proper conclusion if you don't want to.
Have you ever run into a bug that, no matter how careful you are trying to reproduce it, it only happens sometimes? And then, you think you've got it, and finally solved it - and tested a couple of times without any manifestation. How do you know that you have tested enough? Are you sure you were not "lucky" in your tests?
In this article we will see how to answer those questions and the math behind it without going into too much detail. This is a pragmatic guide.
The following program is supposed to generate two random 8-bit integer and print them on stdout:
#include <stdio.h> #include <fcntl.h> #include <unistd.h> /* Returns -1 if error, other number if ok. */ int get_random_chars(char *r1, char*r2) { int f = open("/dev/urandom", O_RDONLY); if (f < 0) return -1; if (read(f, r1, sizeof(*r1)) < 0) return -1; if (read(f, r2, sizeof(*r2)) < 0) return -1; close(f); return *r1 & *r2; } int main(void) { char r1; char r2; int ret; ret = get_random_chars(&r1, &r2); if (ret < 0) fprintf(stderr, "error"); else printf("%d %d\n", r1, r2); return ret < 0; }
On my architecture (Linux on IA-32) it has a bug that makes it print "error" instead of the numbers sometimes.
Every time we run the program, the bug can either show up or not. It has a non-deterministic behaviour that requires statistical analysis.
We will model a single program run as a Bernoulli trial, with success defined as "seeing the bug", as that is the event we are interested in. We have the following parameters when using this model:
As a Bernoulli trial, the number of errors \(k\) of running the program \(n\) times follows a binomial distribution \(k \sim B(n,p)\). We will use this model to estimate \(p\) and to confirm the hypotheses that the bug no longer exists, after fixing the bug in whichever way we can.
By using this model we are implicitly assuming that all our tests are performed independently and identically. In order words: if the bug happens more ofter in one environment, we either test always in that environment or never; if the bug gets more and more frequent the longer the computer is running, we reset the computer after each trial. If we don't do that, we are effectively estimating the value of \(p\) with trials from different experiments, while in truth each experiment has its own \(p\). We will find a single value anyway, but it has no meaning and can lead us to wrong conclusions.
Another way of thinking about the model and the strategy is by creating a physical analogy with a box that has an unknown number of green and red balls:
Some things become clearer when we think about this analogy:
Before we try fixing anything, we have to know more about the bug, starting by the probability \(p\) of reproducing it. We can estimate this probability by dividing the number of times we see the bug \(k\) by the number of times we tested for it \(n\). Let's try that with our sample bug:
$ ./hasbug 67 -68 $ ./hasbug 79 -101 $ ./hasbug error
We know from the source code that \(p=25%\), but let's pretend that we don't, as will be the case with practically every non-deterministic bug. We tested 3 times, so \(k=1, n=3 \Rightarrow p \sim 33%\), right? It would be better if we tested more, but how much more, and exactly what would be better?
Let's go back to our box analogy: imagine that there are 4 balls in the box, one red and three green. That means that \(p = 1/4\). What are the possible results when we test three times?
Red balls | Green balls | \(p\) estimate |
---|---|---|
0 | 3 | 0% |
1 | 2 | 33% |
2 | 1 | 66% |
3 | 0 | 100% |
The less we test, the smaller our precision is. Roughly, \(p\) precision will be at most \(1/n\) - in this case, 33%. That's the step of values we can find for \(p\), and the minimal value for it.
Testing more improves the precision of our estimate.
Let's now approach the problem from another angle: if \(p = 1/4\), what are the odds of seeing one error in four tests? Let's name the 4 balls as 0-red, 1-green, 2-green and 3-green:
The table above has all the possible results for getting 4 balls out of the box. That's \(4^4=256\) rows, generated by this python script. The same script counts the number of red balls in each row, and outputs the following table:
k | rows | % |
---|---|---|
0 | 81 | 31.64% |
1 | 108 | 42.19% |
2 | 54 | 21.09% |
3 | 12 | 4.69% |
4 | 1 | 0.39% |
That means that, for \(p=1/4\), we see 1 red ball and 3 green balls only 42% of the time when getting out 4 balls.
What if \(p = 1/3\) - one red ball and two green balls? We would get the following table:
k | rows | % |
---|---|---|
0 | 16 | 19.75% |
1 | 32 | 39.51% |
2 | 24 | 29.63% |
3 | 8 | 9.88% |
4 | 1 | 1.23% |
What about \(p = 1/2\)?
k | rows | % |
---|---|---|
0 | 1 | 6.25% |
1 | 4 | 25.00% |
2 | 6 | 37.50% |
3 | 4 | 25.00% |
4 | 1 | 6.25% |
So, let's assume that you've seen the bug once in 4 trials. What is the value of \(p\)? You know that can happen 42% of the time if \(p=1/4\), but you also know it can happen 39% of the time if \(p=1/3\), and 25% of the time if \(p=1/2\). Which one is it?
The graph bellow shows the discrete likelihood for all \(p\) percentual values for getting 1 red and 3 green balls:
The fact is that, given the data, the estimate for \(p\) follows a beta distribution \(Beta(k+1, n-k+1) = Beta(2, 4)\) (1) The graph below shows the probability distribution density of \(p\):
The R script used to generate the first plot is here, the one used for the second plot is here.
What happens when we test more? We obviously increase our precision, as it is at most \(1/n\), as we said before - there is no way to estimate that \(p=1/3\) when we only test twice. But there is also another effect: the distribution for \(p\) gets taller and narrower around the observed ratio \(k/n\):
So, which value will we use for \(p\)?
By using this framework we have direct, visual and tangible incentives to test more. We can objectively measure the potential contribution of each test.
In order to calculate \(p_{min}\) with the mentioned properties, we have to solve the following equation:
\[\sum_{k=0}^{k}{n\choose{k}}p_{min} ^k(1-p_{min})^{n-k}=\frac{\alpha}{2} \]
\(alpha\) here is twice the error we want to tolerate: 5% for an error of 2.5%.
That's not a trivial equation to solve for \(p_{min}\). Fortunately, that's the formula for the confidence interval of the binomial distribution, and there are a lot of sites that can calculate it:
So, you have tested a lot and calculated \(p_{min}\). The next step is fixing the bug.
After fixing the bug, you will want to test again, in order to confirm that the bug is fixed. How much testing is enough testing?
Let's say that \(t\) is the number of times we test the bug after it is fixed. Then, if our fix is not effective and the bug still presents itself with a probability greater than the \(p_{min}\) that we calculated, the probability of not seeing the bug after \(t\) tests is:
\[\alpha = (1-p_{min})^t \]
Here, \(\alpha\) is also the probability of making a type I error, while \(1 - \alpha\) is the statistical significance of our tests.
We now have two options:
Both options are valid. The first one is not always feasible, as the cost of each trial can be high in time and/or other kind of resources.
The standard statistical significance in the industry is 5%, we recommend either that or less.
Formally, this is very similar to a statistical hypothesis testing.
This file has the results found after running our program 5000 times. We must never throw out data, but let's pretend that we have tested our program only 20 times. The observed \(k/n\) ration and the calculated \(p_{min}\) evolved as shown in the following graph:
After those 20 tests, our \(p_{min}\) is about 12%.
Suppose that we fix the bug and test it again. The following graph shows the statistical significance corresponding to the number of tests we do:
In words: we have to test 24 times after fixing the bug to reach 95% statistical significance, and 35 to reach 99%.
Now, what happens if we test more before fixing the bug?
Let's now use all the results and assume that we tested 5000 times before fixing the bug. The graph bellow shows \(k/n\) and \(p_{min}\):
After those 5000 tests, our \(p_{min}\) is about 23% - much closer to the real \(p\).
The following graph shows the statistical significance corresponding to the number of tests we do after fixing the bug:
We can see in that graph that after about 11 tests we reach 95%, and after about 16 we get to 99%. As we have tested more before fixing the bug, we found a higher \(p_{min}\), and that allowed us to test less after fixing the bug.
We have seen that we decrease \(t\) as we increase \(n\), as that can potentially increases our lower estimate for \(p\). Of course, that value can decrease as we test, but that means that we "got lucky" in the first trials and we are getting to know the bug better - the estimate is approaching the real value in a non-deterministic way, after all.
But, how much should we test before fixing the bug? Which value is an ideal value for \(n\)?
To define an optimal value for \(n\), we will minimize the sum \(n+t\). This objective gives us the benefit of minimizing the total amount of testing without compromising our guarantees. Minimizing the testing can be fundamental if each test costs significant time and/or resources.
The graph bellow shows us the evolution of the value of \(t\) and \(t+n\) using the data we generated for our bug:
We can see clearly that there are some low values of \(n\) and \(t\) that give us the guarantees we need. Those values are \(n = 15\) and \(t = 24\), which gives us \(t+n = 39\).
While you can use this technique to minimize the total number of tests performed (even more so when testing is expensive), testing more is always a good thing, as it always improves our guarantee, be it in \(n\) by providing us with a better \(p\) or in \(t\) by increasing the statistical significance of the conclusion that the bug is fixed. So, before fixing the bug, test until you see the bug at least once, and then at least the amount specified by this technique - but also test more if you can, there is no upper bound, specially after fixing the bug. You can then report a higher confidence in the solution.
When a programmer finds a bug that behaves in a non-deterministic way, he knows he should test enough to know more about the bug, and then even more after fixing it. In this article we have presented a framework that provides criteria to define numerically how much testing is "enough" and "even more." The same technique also provides a method to objectively measure the guarantee that the amount of testing performed provides, when it is not possible to test "enough."
We have also provided a real example (even though the bug itself is artificial) where the framework is applied.
As usual, the source code of this page (R scripts, etc) can be found and downloaded in https://github.com/lpenz/lpenz.github.io
## File
file <- "myfile.txt"
## Create connection
con <- file(description=file, open="r")
## Hopefully you know the number of lines from some other source or
com <- paste("wc -l ", file, " | awk '{ print $1 }'", sep="")
n <- system(command=com, intern=TRUE)
## Loop over a file connection
for(i in 1:n) {
tmp <- scan(file=con, nlines=1, quiet=TRUE)
## do something on a line of data
}
by Unknown (noreply@blogger.com) at December 01, 2013 10:55 PM
ped <- data.frame( id=c( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
fid=c( NA, NA, 2, 2, 4, 2, 5, 5, NA, 8),
mid=c( NA, NA, 1, NA, 3, 3, 6, 6, NA, 9))
## install.packages(pkgs="pedigreemm")
library(package="pedigreemm")
ped2 <- with(ped, pedigree(sire=fid, dam=mid, label=id))
U <- relfactor(ped2)
A <- crossprod(U)
round(U, digits=2)
## 10 x 10 sparse Matrix of class "dtCMatrix"
## [1,] 1 . 0.50 . 0.25 0.25 0.25 0.25 . 0.12
## [2,] . 1 0.50 0.50 0.50 0.75 0.62 0.62 . 0.31
## [3,] . . 0.71 . 0.35 0.35 0.35 0.35 . 0.18
## [4,] . . . 0.87 0.43 . 0.22 0.22 . 0.11
## [5,] . . . . 0.71 . 0.35 0.35 . 0.18
## [6,] . . . . . 0.71 0.35 0.35 . 0.18
## [7,] . . . . . . 0.64 . . .
## [8,] . . . . . . . 0.64 . 0.32
## [9,] . . . . . . . . 1 0.50
## [10,] . . . . . . . . . 0.66
## To check
U - chol(A)
round(A, digits=2)
## 10 x 10 sparse Matrix of class "dsCMatrix"
## [1,] 1.00 . 0.50 . 0.25 0.25 0.25 0.25 . 0.12
## [2,] . 1.00 0.50 0.50 0.50 0.75 0.62 0.62 . 0.31
## [3,] 0.50 0.50 1.00 0.25 0.62 0.75 0.69 0.69 . 0.34
## [4,] . 0.50 0.25 1.00 0.62 0.38 0.50 0.50 . 0.25
## [5,] 0.25 0.50 0.62 0.62 1.12 0.56 0.84 0.84 . 0.42
## [6,] 0.25 0.75 0.75 0.38 0.56 1.25 0.91 0.91 . 0.45
## [7,] 0.25 0.62 0.69 0.50 0.84 0.91 1.28 0.88 . 0.44
## [8,] 0.25 0.62 0.69 0.50 0.84 0.91 0.88 1.28 . 0.64
## [9,] . . . . . . . . 1.0 0.50
## [10,] 0.12 0.31 0.34 0.25 0.42 0.45 0.44 0.64 0.5
1.
0
0
## install.packages(pkgs="bdsmatrix")
library(package="bdsmatrix")
tmp <- gchol(as.matrix(A))
D <- diag(tmp)
(T <- as(as.matrix(tmp), "dtCMatrix"))
## 10 x 10 sparse Matrix of class "dtCMatrix"
## [1,] 1.000 . . . . . . . . .
## [2,] . 1.0000 . . . . . . . .
## [3,] 0.500 0.5000 1.00 . . . . . . .
## [4,] . 0.5000 . 1.000 . . . . . .
## [5,] 0.250 0.5000 0.50 0.500 1.00 . . . . .
## [6,] 0.250 0.7500 0.50 . . 1.00 . . . .
## [7,] 0.250 0.6250 0.50 0.250 0.50 0.50 1 . . .
## [8,] 0.250 0.6250 0.50 0.250 0.50 0.50 . 1.0 . .
## [9,] . . . . . . . . 1.0 .
## [10,] 0.125 0.3125 0.25 0.125 0.25 0.25 . 0.5 0.5 1
## To chec
k
L
&
lt;
- T %*% diag(sqrt(D))
L - t(U)
(TInv <- as(ped2, "sparseMatrix"))
## 10 x 10 sparse Matrix of class "dtCMatrix" (unitriangular)
## 1 1.0 . . . . . . . . .
## 2 . 1.0 . . . . . . . .
## 3 -0.5 -0.5 1.0 . . . . . . .
## 4 . -0.5 . 1.0 . . . . . .
## 5 . . -0.5 -0.5 1.0 . . . . .
## 6 . -0.5 -0.5 . . 1.0 . . . .
## 7 . . . . -0.5 -0.5 1 . . .
## 8 . . . . -0.5 -0.5 . 1.0 . .
## 9 . . . . . . . . 1.0 .
## 10 . . . . . . . -0.5 -0.5 1
round(DInv <- Diagonal(x=1/Dmat(ped2)), digits=2)
## 10 x 10 diagonal matrix of class "ddiMatrix"
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 . . . . . . . . .
## [2,] . 1 . . . . . . . .
## [3,] . . 2 . . . . . . .
## [4,] . . . 1.33 . . . . . .
## [5,] . . . . 2 . . . . .
## [6,] . . . . . 2 . . . .
## [7,] . . . . . . 2.46 . . .
## [8,] . . . . . . . 2.46 . .
## [9,] . . . . . . . . 1 .
## [10,] . . . . . . . . . 2.33
round(t(TInv) %*% DInv %*% TInv, digits=2)
## 10 x 10 sparse Matrix of class "dgCMatrix"
## ...
round(crossprod(sqrt(DInv) %*% TInv), digits=2)
## 10 x 10 sparse Matrix of class "dsCMatrix"
## [1,] 1.5 0.50 -1.0 . . . . . . .
## [2,] 0.5 2.33 -0.5 -0.67 . -1.00 . . . .
## [3,] -1.0 -0.50 3.0 0.50 -1.00 -1.00 . . . .
## [4,] . -0.67 0.5 1.83 -1.00 . . . . .
## [5,] . . -1.0 -1.00 3.23 1.23 -1.23 -1.23 . .
## [6,] . -1.00 -1.0 . 1.23 3.23 -1.23 -1.23 . .
## [7,] . . . . -1.23 -1.23 2.46 . . .
## [8,] . . . . -1.23 -1.23 . 3.04 0.58 -1.16
## [9,] . . . . . . . 0.58 1.58 -1.16
## [10,] . . . . . . . -1.16 -1.16
2
.3
3
#
# T
o c
heck
so
l
ve
(A
) - crossprod(sqrt(DInv) %*% TInv)
by Unknown (noreply@blogger.com) at August 13, 2013 02:28 PM
## Collect arguments
args <- commandArgs(TRUE)
## Default setting when no arguments passed
if(length(args) < 1) {
args <- c("--help")
}
## Help section
if("--help" %in% args) {
cat("
The R Script
Arguments:
--arg1=someValue - numeric, blah blah
--arg2=someValue - character, blah blah
--arg3=someValue - logical, blah blah
--help - print this text
Example:
./test.R --arg1=1 --arg2="output.txt" --arg3=TRUE \n\n")
q(save="no")
}
## Parse arguments (we expect the form --arg=value)
parseArgs <- function(x) strsplit(sub("^--", "", x), "=")
argsDF <- as.data.frame(do.call("rbind", parseArgs(args)))
argsL <- as.list(as.character(argsDF$V2))
names(argsL) <- argsDF$V1
## Arg1 default
if(is.null(args$arg1)) {
## do something
}
## Arg2 default
if(is.null(args$arg2)) {
## do something
}
## Arg3 default
if(is.null(args$arg3)) {
## do something
}
## ... your code here ...
This blog is moving to blog.r-enthusiasts.com. The new one is powered by wordpress and gets a subdomain of r-enthusiasts.com
.
See you there