# Planet R

## December 11, 2013

### CRANberries

#### New package rpartitions with initial version 0.1

Package: rpartitions
Title: Code for integer partitioning
Description: Provides algorithims for randomly sampling a feasible set defined by a given total and number of elements using integer partitioning.
Version: 0.1
Author: Ken Locey, Daniel McGlinn
Maintainer: Daniel McGlinn
Depends: R (>= 2.15.1), hash
Suggests: testthat (>= 0.2)
NeedsCompilation: yes
Repository: CRAN
URL: https://github.com/klocey/partitions
LazyData: true
Collate: 'rpartitions.R' 'rpartitions-package.r'
Date/Publication: 2013-12-11 07:34:59

### Removed CRANberries

#### Package DataFrameConstr (with last version 1.1-2) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2013-03-20 1.1-2
2013-03-19 1.1-0

### Revolutions

#### In case you missed it: November 2013 Roundup

In case you missed them, here are some articles from October of particular interest to R users:

A recap of the Strata Hadoop World conference, including presentations from Monsanto and eHarmony on their use of R.

How to rank and chart the most frequent hashtags of Twitter users with R.

A replay of the What’s new in Revolution R Enterprise 7 webinar.

Using Plotly’s new interface with R, plus reports from R user groups.

An analysis of World Series Baseball strikeout rates using R.

CRAN surpasses the milestone of 5,000 R packages, thanks to the volunteers who maintain the system.

A detailed guide to memory usage in R, from Hadley Wickham.

Running R inside Teradata Database with Revolution R Enterprise (webinar replay).

The Human Rights Data Analysis Group uses R to estimate the number of casualties in Syria.

Data Scientists coming from computer science backgrounds can learn from the small-data techniques of Statistics — even with Big Data.

A tutorial on using iterators in R

A Princeton University guide translates common Stata commands into R code.

Joseph Rickert surveys the available packages for Bayesian analysis with R

In an interview with DataInformed, I discussed the rise of R as the lingua franca of data science, and how the Big Data revolution has led companies to adopt statistical decision making.

DataMind’s www.r-fiddle.org is an online scratchpad GUI for R programmers.

R connections to startups Domino, Plotly and Quandl.

A Thanksgiving greeting from Revolution Analytics (with an assist from the cowsay package).

Some non-R stories in the past month included: Virgin’s safety dance, a mouse perseveressonification of sorting algorithms, a marked rise in data scientist job postings, real-life Mario and comet ISON rounds the sun.

As always, thanks for the comments and please send any suggestions to me at david@revolutionanalytics.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

## December 10, 2013

### CRANberries

#### New package binomSamSize with initial version 0.1-3

Package: binomSamSize
Type: Package
Title: Confidence intervals and sample size determination for a binomial proportion under simple random sampling and pooled sampling
Version: 0.1-3
Date: 2013-12-10
Author: Michael Hoehle with contributions by Wei Liu
Maintainer: Michael Hoehle
Depends: binom
Description: A suite of functions to compute confidence intervals and necessary sample sizes for the parameter p of the Bernoulli B(p) distribution under simple random sampling or under pooled sampling. Such computations are e.g. of interest when investigating the incidence or prevalence in populations. The package contains functions to compute coverage probabilities and coverage coefficients of the provided confidence intervals procedures. Sample size calculations are based on expected length.
Encoding: latin1
Packaged: 2013-12-10 21:29:01 UTC; hoehle
NeedsCompilation: yes
Repository: CRAN
Date/Publication: 2013-12-10 23:04:10

#### New package gsscopu with initial version 0.9-1

Package: gsscopu
Version: 0.9-1
Date: 2013-12-9
Title: Copula Density and 2-D Hazard Estimation using Smoothing Splines
Author: Chong Gu
Maintainer: Chong Gu
Depends: R (>= 2.14.0), gss (>= 2.1-0)
Description: A collection of routines for the estimation of copula density and 2-D hazard function using smoothing splines.
Packaged: 2013-12-09 15:17:51 UTC; chong
NeedsCompilation: yes
Repository: CRAN
Date/Publication: 2013-12-10 16:13:19

### Alstatr

#### R: Explore ARIMA(2, 2, 2) subclass family on Shiny

I've been thinking that it might be better to explore the Box-Jenkins ARIMA (Autoregressive Integrated Moving-Average) three-iterative modelling on Shiny. So here is what I got, this app is intended for ARIMA(2, 2, 2) subclass family only.

The app has six tabs, and these are:
1. Historical Plot;
2. Identification;
3. Estimation;
4. Diagnostic;
5. Forecast; and
6. Data
The first tab is where the time plot of the simulated time series, the series can be simulated from different subclass family of ARIMA(2, 2, 2). The order is assigned using the controls in the side panel. The values of the parameters are set in the text field right below the plot. So for example, the ARIMA(1, 1, 1) has two text fields for AR (Autoregressive) and MA (Moving-Average) parameters as shown below. The default parameter for all models is 0.3.

The next tab is Identification, this is the first stage of Box-Jenkins iterative modelling. Here, the model is identified using the correlograms, Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF).

Inside this tab are three subtabs: ACF, PACF, and Differenced Series. The third subtab (Differenced Series) shows nothing unless you apply differencing (by checking Apply Differencing? check box). Now after identifying the model, we then proceed to the second stage of modelling which is the Estimation.

The side panel here is changed, the range for both Autoregressive and Moving-Average sliders is now from 0 to 5. I extend this, so that we can play with the different fitted values from different combinations of model's orders. And of course, the order of differencing remains the same. If you prefer to include constant (intercept) in the model, you can check the box. Note that, the intercept is ignored if there is differencing in the model (from arima R Documentation).

After obtaining the estimates, let's diagnose the model, by cheking for randomness of the residuals. So here is the main panel of Diagnostic tab,

As you notice, I made a simple hypothesis testing which is dynamic. The values in COMPUTATION section updates automatically if we change the order of the identified model. The same case for the last two sections (DECISION and CONCLUSION). You will see how it works especially when we deviate the order of the model from the identified one, say if we simulate ARIMA(1, 1, 0), and we assign the identified order as ARIMA(0, 0, 1), then we will have something like this,

The p-value is less than 0.05, thus the decision is to reject the null hypothesis, which concludes that the residuals are not independently distributed. And if we look at the time plot of the residuals, we see a pattern on it which isn't random, hence agrees with the conclusion. On this tab, you will find a new option for histogram, an idea from the basic Shiny's example.

Finally, we compute the predicted values and plot this along with the original data.

And as mentioned, the sixth tab is the data.

### CRANberries

#### New package VideoComparison with initial version 0.9-4

Package: VideoComparison
Version: 0.9-4
Date: 2013-12-09
Title: Video comparison tool
Author: Silvia Espinosa, Joaquin Ordieres, Antonio Bello, Jose Maria Perez
Maintainer: Joaquin Ordieres
Depends: R (>= 2.15.2), RJSONIO, RCurl, zoo, stats, pracma, Rcpp (>= 0.10.3)
Suggests: MASS
Description: It will take the vectors of motion for two videos (coming from a variant of shotdetect code allowing to store detailed motion vectors in json format, for instance) and it will look for comparing taking out the common chunk. Then, provided you have some image's hashes it will compare their signature in order to make up the decision about chunk similarity of two video files.
URL: http://www.r-project.org
Packaged: 2013-12-09 08:58:34 UTC; jmperez
NeedsCompilation: yes
Repository: CRAN
Date/Publication: 2013-12-10 13:41:44

#### New package pathmox with initial version 0.2.0

Package: pathmox
Type: Package
Title: Pathmox Approach of Segmentation Trees in Partial Least Squares Path Modeling
Version: 0.2.0
Date: 2013-12-09
Authors@R: c( person("Gaston", "Sanchez", email = "gaston.stat@gmail.com", role=c("aut", "cre")), person("Tomas", "Aluja", role=c("aut", "ths")))
Description: pathmox, the cousin package of plspm, provides a very interesting solution for handling segmentation variables in PLS Path Modeling: segmentation trees in PLS Path Modeling.
URL: http://www.gastonsanchez.com
Depends: R (>= 3.0.1), plspm, tester
Suggests: plsdepot
Collate: 'fix.pathmox.R' 'fix.techmox.R' 'pathmox-internal.R' 'pathmox.R' 'plot.bootnodes.R' 'plot.treemox.R' 'plot.treemox.pls.R' 'plspmox.R' 'print.bootnodes.R' 'print.treemox.R' 'print.treemox.pls.R' 'techmox.R' 'treemox.boot.R' 'treemox.pls.R' 'get_fix_xexeloa.R' 'get_nominal_split.R' 'get_ordinal_split.R' 'get_pls_basic.R' 'get_xexeloa.R' 'test_sanchez_aluja.R' 'check_pathmox_args.r'
Packaged: 2013-12-09 18:32:14 UTC; Gaston
Author: Gaston Sanchez [aut, cre], Tomas Aluja [aut, ths]
Maintainer: Gaston Sanchez
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2013-12-10 13:34:48

#### New package glmx with initial version 0.1-0

Package: glmx
Version: 0.1-0
Date: 2013-12-10
Title: Generalized Linear Models Extended
Authors@R: c(person(given = "Achim", family = "Zeileis", role = c("aut", "cre"), email = "Achim.Zeileis@R-project.org"), person(given = "Roger", family = "Koenker", role = "aut"), person(given = "Philipp", family = "Doebler", role = "aut"))
Description: Extended techniques for generalized linear models (GLMs), especially for binary responses, including parametric links and heteroskedastic latent variables.
Depends: R (>= 2.14.0)
Imports: stats, MASS, Formula, lmtest, sandwich
Suggests: AER, gld, numDeriv, pscl
Packaged: 2013-12-10 11:40:00 UTC; zeileis
Author: Achim Zeileis [aut, cre], Roger Koenker [aut], Philipp Doebler [aut]
Maintainer: Achim Zeileis
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2013-12-10 13:11:15

### Dirk Eddelbuettel

A new Armadillo release 3.930 came out a few days ago, with a very nice set of changes (see below). I rolled this into RcppArmadillo 0.3.930.0. However, one of these changes revealed that R shipped only the standard SVD for complex-valued matrices, and not the more performant divide-and-conquer approach. So in R builds using the default built-in Lapack, at least one CRAN package no longer built.

After some back and forth, Conrad put some branching in the library to fall back to the standard SVD, and I added a built-time configuration test for an appropriate preprocessor directive used by the fallback code. This is now on which is now on CRAN and in Debian as RcppArmadillorelease 0.3.930.1, and Conrad will probably update the Armadillo page as well (though the fix is only needed with R's builtin Rlapack). Also of note is that R Core already added the missing Fortran routine zgesdd to R 3.1.0 (aka "R-devel") so this issue goes away with the next release. Also of note, I wrote up a short Rcpp Gallery post illustrating the performance gains available from divide-and-conquer SVD.

The complete list of changes is below.

#### Changes in RcppArmadillo version 0.3.930.1 (2013-12-09)

• Armadillo falls back to standard complex svd if the more performant divide-and-conquer variant is unavailable

• Added detection for Lapack library and distinguish between R's own version (withhout zgesdd) and system Lapack; a preprocessor define is set accordingly

#### Changes in RcppArmadillo version 0.3.930.0 (2013-12-06)

• added divide-and-conquer variant of svd_econ(), for faster SVD

• added divide-and-conquer variant of pinv(), for faster pseudo-inverse

• added element-wise variants of min() and max()

• added size() based specifications of submatrix view sizes

• added randi() for generating matrices with random integer values

• added more intuitive specification of sort direction in sort() and sort_index()

• added more intuitive specification of method in det(), .i(), inv() and solve()

• added more precise timer for the wall_clock class when using C++11

• New unit tests for complex matrices and vectors

Courtesy of CRANberries, there is also a diffstat report for the most recent release As always, more detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

### CRANberries

#### New package ICGE with initial version 0.2

Package: ICGE
Version: 0.2
Date: 2011-12-21
Title: Estimation of number of clusters and identification of atypical units
Authors@R: c(person("Itziar", "Irigoien", role = c("aut", "cre"), email = "itziar.irigoien@ehu.es"), person("Concepcion", "Arenas", role = "aut", email="carenas@ub.edu"))
Author: Itziar Irigoien and Conchita Arenas
Maintainer: Itziar Irigoien
Depends: R (>= 2.0.1), MASS, utils, stats, cluster
Description: ICGE is a package that helps to estimate the number of real clusters in data as well as to identify atypical units. The underlying methods are based on distances rather than on unit x variables.
LazyData: yes
Packaged: 2013-12-10 10:58:01 UTC; itziar
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2013-12-10 12:34:50

### Removed CRANberries

#### Package clues (with last version 0.5-0) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2010-02-03 0.5-0
2009-12-24 0.4.0
2009-11-17 0.3.9
2009-03-17 0.3.2
2009-02-16 0.3.1
2009-02-12 0.2.9

#### Package bayesGARCH (with last version 1-00.10) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2011-04-15 1-00.10
2011-02-10 1-00.09
2011-01-04 1-00.08
2010-08-31 1-00.06
2009-11-25 1-00.05
2009-05-11 1-00.04
2008-12-04 1-00.02
2008-06-03 1-00.01

#### Package AcceptanceSampling (with last version 1.0-2) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2012-07-02 1.0-2

#### Package rsae (with last version 0.1-4) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2012-01-08 0.1-4
2011-07-25 0.1-3
2011-07-23 0.1-2

#### Package AdMit (with last version 1-01.06) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2012-09-08 1-01.06
2012-08-21 1-01.04
2011-02-18 1-01.03.1
2009-08-20 1-01.03
2009-05-11 1-01.02
2009-01-25 1-01.01
2008-12-01 1-00.04
2008-06-11 1-00.02

#### Package rmac (with last version 0.9) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2010-08-31 0.9

#### Package int64 (with last version 1.1.2) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2011-12-02 1.1.2
2011-11-28 1.1.1
2011-11-24 1.1.0
2011-11-20 1.0.0

#### Package probemapper (with last version 1.0.0) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2011-06-19 1.0.0

#### Package ToxLim (with last version 1.0) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2010-03-26 1.0

### CRANberries

#### New package corclass with initial version 0.1

Package: corclass
Type: Package
Title: Correlational Class Analysis
Version: 0.1
Date: 2013-11-25
Author: Andrei Boutyline
Maintainer: Andrei Boutyline
Description: Perform a correlational class analysis of the data, resulting in a partition of the data into separate modules.
Depends: igraph
Suggests: Cairo
Packaged: 2013-12-10 07:04:09 UTC; ElectricShhh
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2013-12-10 08:36:54

### Revolutions

#### The impact of Big Data on video gamers

Because video games happen in a virtual world, it's possible to measure just about every aspect of the game. It's kind of like being able to observe a sports match or a battle, but being able to attach a telemetry sensor to every player, every weapon and bullet, every surface of the environment, and gather all that data in real time. The Big Data revolution has made this possible, and video game companies routinely gather 50 terabytes of data per day to improve their games, operations and revenue.

But from the player's point of view, can analyzing this data improve their performance? Just as Moneyball revolutionalized baseball, can analyzing video game data improve the success of a professional gamer? Video gaming magazine Kill Screen asked how big data helps and hurts League of Legends players in a recent article and suggested it can help many players:

“Almost all players will reach a point where they will plateau without self-reflection, analysis, and focused practice,” Sabine Hemmi of the League of Legends stat site Elobuff tells me. “Any player who understands the basics can learn from statistics. It will be easier for them to identify their weaknesses and focus on improving.”

But when it comes to the elite players, the numbers may not be as much help:

“If you go to a local chess club and pick a low-level player, it’s easy to spot flaws in their game,” explains Bill Grosso, CEO of Scientific Revenue. “Then, you go to the world championship of chess. It becomes really, really hard.”

Nonetheless, it's clear that data analysis is revolutionizing the video game industry, as Bill Grosso rescribed in a recent Revolution Analytics webinar, "Knowing How People are Playing Your Game Gives You the Winning Hand". Follow that link for the webinar replay, or click the link below for the full Kill Screen article.

## December 09, 2013

### CRANberries

#### New package kyotil with initial version 2013.12-9

Package: kyotil
LazyData: yes
Version: 2013.12-9
Date:
Title: Utility Functions by Youyi & Krisz
Author: Youyi Fong , Krisztian Sebestyen
Maintainer: Youyi Fong
Depends: R (>= 3.0.0)
Imports:
Suggests: RUnit, knitr, lme4, nlme, xtable, MASS
VignetteBuilder: knitr
Description: A miscellaneous set of functions for printing, plotting, kernels, etc.
Packaged: 2013-12-09 17:42:05 UTC; yfong
NeedsCompilation: yes
Repository: CRAN
Date/Publication: 2013-12-09 21:15:13

#### New package CORM with initial version 1.0.0

Package: CORM
Type: Package
Version: 1.0.0
Description: We proposed a new model-based clustering method-the clustering of regression models method, which groups genes that share a similar relationship to the covariate(s). This method provides a unified approach for a family of clustering procedures and can be applied for data collected with various experimental designs. This package includes the implementation for two such clustering procedures: (1) clustering of linear models (CLM), and (2) clustering of linear mixed models (CLMM).
Title: Clustering of Regression Models Method
Authors@R: c(person("Li-Xuan","Qin",role=c("aut"),email="qinl@mskcc.org"), person("Jiejun","Shi",role=c("cre"),email="shi.abraham.2010@gmail.com"))
Depends: R (>= 2.10.0), cluster
Suggests: MASS
URL: http://www.r-project.org, http://www.mskcc.org/mskcc/html/60448.cfm
Packaged: 2013-12-09 16:05:06 UTC; abraham
Author: Li-Xuan Qin [aut], Jiejun Shi [cre]
Maintainer: Jiejun Shi
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2013-12-09 21:12:21

### Revolutions

#### 14 Analytics Predictions for 2014

In a live webinar today hosted by Alteryx, five industry experts shared 14 analytics predictions for 2014. The panel included Paul Ross (Alteryx), Charles Zedlewski (Cloudera), Rick Schultz (Alteryx), Ellie Fields (Tableau) and Michele Chambers (Revolution Analytics). Their predictions were:

1. Analysts will matter more than data scientists
2. R will replace legacy SAS solutions and go mainstream
3. Big Data will bring its "A game" in sports marketing
4. Hadoop moves from curiosity to critical
5. Gartner's prediction that the line-of-business will drive analytics spend will happen
6. Visual analytics continues to grow but users need more
7. Analysts lives get more complex, but also easier
8. Predictive analytics will no longer be a specialist subject
9. Customer analytics is the next big marketing role
10. A new analytics stack will emerge
11. Location meets big data analytics
12. NoSQL meets analytics

There was some great discussion amongst the panelists (you can register for the replay here) but I'd say there were two major themes amongst the predictions. Firstly, that next-generation platforms for analytics have matured and are displacing legacy systems (#2, #4, #10 and #12) — the growth of R and Hadoop, especially, have played a big part in this. Secondly, that the "last mile problem" of connecting business analysts with the predictive insights they need is being solved, and is empowering many more decision makers (#1, #3, #7, #8, #9) — in other words, we have evolved beyond BI and are making predictive analytics a key part of the operational process. If I were to take a contrarian view of the discussion, it would be that I felt it downplayed the role of data scientists a little too much: as I see it, data scientists will continue to have a critical role in every organization, understanding business problems and the data sources, computing platforms, and statistical techniquest that can be deployed to solve them. But I wholeheartedly agree that in 2014 data scientists will be providing insights to a rapidly growing number of business analysts, by integrating their work into tools like Alteryx and Tableau.

You can see the slides from the #14for14 presentation above, and register for the replay at the link below.

Alteryx webinars: 14 for 14: Analytics Predictions for 2014

### Removed CRANberries

#### Package DPM.GGM (with last version 0.14) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2011-11-25 0.14
2011-11-23 0.13

### RCpp Gallery

#### Performance of the divide-and-conquer SVD algorithm

The ubiquitous LAPACK library provides several implementations for the singular-value decomposition (SVD). We will illustrate possible speed gains from using the divide-and-conquer method by comparing it to the base case.

#include <RcppArmadillo.h>

// [[Rcpp::export]]
arma::vec baseSVD(const arma::mat & X) {
arma::mat U, V;
arma::vec S;
arma::svd(U, S, V, X, "standard");
return S;
}

// [[Rcpp::export]]
arma::vec dcSVD(const arma::mat & X) {
arma::mat U, V;
arma::vec S;
arma::svd(U, S, V, X, "dc");
return S;
}


Having the two implementations, which differ only in the method argument (added recently in Armadillo 3.930), we are ready to do a simple timing comparison.

library(microbenchmark)
set.seed(42)
X <- matrix(rnorm(16e4), 4e2, 4e2)
microbenchmark(baseSVD(X), dcSVD(X))

Unit: milliseconds
expr   min    lq median    uq   max neval
baseSVD(X) 421.2 422.6  424.2 426.2 442.1   100
dcSVD(X) 111.0 111.5  111.9 113.6 126.1   100


The speed gain is noticeable which a ratio of about 3.9 at the median. However, we can also look at complex-valued matrices.

// [[Rcpp::export]]
arma::vec cxBaseSVD(const arma::cx_mat & X) {
arma::cx_mat U, V;
arma::vec S;
arma::svd(U, S, V, X, "standard");
return S;
}

// [[Rcpp::export]]
arma::vec cxDcSVD(const arma::cx_mat & X) {
arma::cx_mat U, V;
arma::vec S;
arma::svd(U, S, V, X, "dc");
return S;
}

A <- matrix(rnorm(16e4), 4e2, 4e2)
B <- matrix(rnorm(16e4), 4e2, 4e2)
X <- A + 1i * B
microbenchmark(cxBaseSVD(X), cxDcSVD(X))

Unit: milliseconds
expr    min     lq median     uq    max neval
cxBaseSVD(X) 1248.7 1253.7 1257.5 1262.3 1311.7   100
cxDcSVD(X)  259.2  259.8  260.5  263.2  327.9   100


Here the difference is even more pronounced at about 4.8. However, it is precisely this complex-value divide-and-conquer method which is missing in R’s own Lapack. So R built with the default configuration can not currently take advantage of the complex-valued divide-and-conquer algorithm. Only builds which use an external Lapack library (as for example the Debian and Ubuntu builds) can. Let’s hope that R will add this functionality to its next release R 3.1.0. Update: And the underlying zgesdd routine has now been added to the upcoming R 3.1.0 release. Nice.

### CRANberries

#### New package geoCount with initial version 1.131209

Package: geoCount
Type: Package
Title: Analysis and Modeling for Geostatistical Count Data
Version: 1.131209
Date: 2013-12-09
Author: Liang Jing
Maintainer: Liang Jing
Description: This package provides a variety of functions to analyze and model geostatistical count data with generalized linear spatial models, including 1) simulate and visualize the data; 2) posterior sampling with robust MCMC algorithms (in serial or parallel way); 3) perform prediction for unsampled locations; 4) conduct Bayesian model checking procedure to evaluate the goodness of fitting; 5) conduct transformed residual checking procedure. In the package, seamlessly embedded C++ programs and parallel computing techniques are implemented to speed up the computing processes.
Depends: R (>= 2.12.0), Rcpp (>= 0.9.4), RcppArmadillo (>= 0.2.19)
Suggests: coda, distrEx, reldist, snow, snowfall
Packaged: 2013-12-09 07:11:43 UTC; jingl
NeedsCompilation: yes
Repository: CRAN
Date/Publication: 2013-12-09 09:03:04

### Leandro Penz

#### Probabilistic bug hunting

Have you ever run into a bug that, no matter how careful you are trying to reproduce it, it only happens sometimes? And then, you think you've got it, and finally solved it - and tested a couple of times without any manifestation. How do you know that you have tested enough? Are you sure you were not "lucky" in your tests?

In this article we will see how to answer those questions and the math behind it without going into too much detail. This is a pragmatic guide.

# The Bug

The following program is supposed to generate two random 8-bit integer and print them on stdout:


#include <stdio.h>
#include <fcntl.h>

/* Returns -1 if error, other number if ok. */
int get_random_chars(char *r1, char*r2)
{
int f = open("/dev/urandom", O_RDONLY);

if (f < 0)
return -1;
if (read(f, r1, sizeof(*r1)) < 0)
return -1;
if (read(f, r2, sizeof(*r2)) < 0)
return -1;
close(f);

return *r1 & *r2;
}

int main(void)
{
char r1;
char r2;
int ret;

ret = get_random_chars(&r1, &r2);

if (ret < 0)
fprintf(stderr, "error");
else
printf("%d %d\n", r1, r2);

return ret < 0;
}



On my architecture (Linux on IA-32) it has a bug that makes it print "error" instead of the numbers sometimes.

# The Model

Every time we run the program, the bug can either show up or not. It has a non-deterministic behaviour that requires statistical analysis.

We will model a single program run as a Bernoulli trial, with success defined as "seeing the bug", as that is the event we are interested in. We have the following parameters when using this model:

• $$n$$: the number of tests made;
• $$k$$: the number of times the bug was observed in the $$n$$ tests;
• $$p$$: the unknown (and, most of the time, unknowable) probability of seeing the bug.

As a Bernoulli trial, the number of errors $$k$$ of running the program $$n$$ times follows a binomial distribution $$k \sim B(n,p)$$. We will use this model to estimate $$p$$ and to confirm the hypotheses that the bug no longer exists, after fixing the bug in whichever way we can.

By using this model we are implicitly assuming that all our tests are performed independently and identically. In order words: if the bug happens more ofter in one environment, we either test always in that environment or never; if the bug gets more and more frequent the longer the computer is running, we reset the computer after each trial. If we don't do that, we are effectively estimating the value of $$p$$ with trials from different experiments, while in truth each experiment has its own $$p$$. We will find a single value anyway, but it has no meaning and can lead us to wrong conclusions.

## Physical analogy

Another way of thinking about the model and the strategy is by creating a physical analogy with a box that has an unknown number of green and red balls:

• Bernoulli trial: taking a single ball out of the box and looking at its color - if it is red, we have observed the bug, otherwise we haven't. We then put the ball back in the box.
• $$n$$: the total number of trials we have performed.
• $$k$$: the total number of red balls seen.
• $$p$$: the total number of red balls in the box divided by the total number of green balls in the box.

• If we open the box and count the balls, we can know $$p$$, in contrast with our original problem.
• Without opening the box, we can estimate $$p$$ by repeating the trial. As $$n$$ increases, our estimate for $$p$$ improves. Mathematically: $p = \lim_{n\to\infty}\frac{k}{n}$
• Performing the trials in different conditions is like taking balls out of several different boxes. The results tell us nothing about any single box.

# Estimating $$p$$

Before we try fixing anything, we have to know more about the bug, starting by the probability $$p$$ of reproducing it. We can estimate this probability by dividing the number of times we see the bug $$k$$ by the number of times we tested for it $$n$$. Let's try that with our sample bug:

  $./hasbug 67 -68$ ./hasbug
79 -101
\$ ./hasbug
error


We know from the source code that $$p=25%$$, but let's pretend that we don't, as will be the case with practically every non-deterministic bug. We tested 3 times, so $$k=1, n=3 \Rightarrow p \sim 33%$$, right? It would be better if we tested more, but how much more, and exactly what would be better?

## $$p$$ precision

Let's go back to our box analogy: imagine that there are 4 balls in the box, one red and three green. That means that $$p = 1/4$$. What are the possible results when we test three times?

Red balls Green balls $$p$$ estimate
0 3 0%
1 2 33%
2 1 66%
3 0 100%

The less we test, the smaller our precision is. Roughly, $$p$$ precision will be at most $$1/n$$ - in this case, 33%. That's the step of values we can find for $$p$$, and the minimal value for it.

Testing more improves the precision of our estimate.

## $$p$$ likelihood

Let's now approach the problem from another angle: if $$p = 1/4$$, what are the odds of seeing one error in four tests? Let's name the 4 balls as 0-red, 1-green, 2-green and 3-green:

The table above has all the possible results for getting 4 balls out of the box. That's $$4^4=256$$ rows, generated by this python script. The same script counts the number of red balls in each row, and outputs the following table:

k rows %
0 81 31.64%
1 108 42.19%
2 54 21.09%
3 12 4.69%
4 1 0.39%

That means that, for $$p=1/4$$, we see 1 red ball and 3 green balls only 42% of the time when getting out 4 balls.

What if $$p = 1/3$$ - one red ball and two green balls? We would get the following table:

k rows %
0 16 19.75%
1 32 39.51%
2 24 29.63%
3 8 9.88%
4 1 1.23%

What about $$p = 1/2$$?

k rows %
0 1 6.25%
1 4 25.00%
2 6 37.50%
3 4 25.00%
4 1 6.25%

So, let's assume that you've seen the bug once in 4 trials. What is the value of $$p$$? You know that can happen 42% of the time if $$p=1/4$$, but you also know it can happen 39% of the time if $$p=1/3$$, and 25% of the time if $$p=1/2$$. Which one is it?

The graph bellow shows the discrete likelihood for all $$p$$ percentual values for getting 1 red and 3 green balls:

The fact is that, given the data, the estimate for $$p$$ follows a beta distribution $$Beta(k+1, n-k+1) = Beta(2, 4)$$ (1) The graph below shows the probability distribution density of $$p$$:

The R script used to generate the first plot is here, the one used for the second plot is here.

## Increasing $$n$$, narrowing down the interval

What happens when we test more? We obviously increase our precision, as it is at most $$1/n$$, as we said before - there is no way to estimate that $$p=1/3$$ when we only test twice. But there is also another effect: the distribution for $$p$$ gets taller and narrower around the observed ratio $$k/n$$:

## Investigation framework

So, which value will we use for $$p$$?

• The smaller the value of $$p$$, the more we have to test to reach a given confidence in the bug solution.
• We must, then, choose the probability of error that we want to tolerate, and take the smallest value of $$p$$ that we can.

A usual value for the probability of error is 5% (2.5% on each side).
• That means that we take the value of $$p$$ that leaves 2.5% of the area of the density curve out on the left side. Let's call this value $$p_{min}$$.
• That way, if the observed $$k/n$$ remains somewhat constant, $$p_{min}$$ will raise, converging to the "real" $$p$$ value.
• As $$p_{min}$$ raises, the amount of testing we have to do after fixing the bug decreases.

By using this framework we have direct, visual and tangible incentives to test more. We can objectively measure the potential contribution of each test.

In order to calculate $$p_{min}$$ with the mentioned properties, we have to solve the following equation:

$\sum_{k=0}^{k}{n\choose{k}}p_{min} ^k(1-p_{min})^{n-k}=\frac{\alpha}{2}$

$$alpha$$ here is twice the error we want to tolerate: 5% for an error of 2.5%.

That's not a trivial equation to solve for $$p_{min}$$. Fortunately, that's the formula for the confidence interval of the binomial distribution, and there are a lot of sites that can calculate it:

# Is the bug fixed?

So, you have tested a lot and calculated $$p_{min}$$. The next step is fixing the bug.

After fixing the bug, you will want to test again, in order to confirm that the bug is fixed. How much testing is enough testing?

Let's say that $$t$$ is the number of times we test the bug after it is fixed. Then, if our fix is not effective and the bug still presents itself with a probability greater than the $$p_{min}$$ that we calculated, the probability of not seeing the bug after $$t$$ tests is:

$\alpha = (1-p_{min})^t$

Here, $$\alpha$$ is also the probability of making a type I error, while $$1 - \alpha$$ is the statistical significance of our tests.

We now have two options:

• arbitrarily determining a standard statistical significance and testing enough times to assert it.
• test as much as we can and report the achieved statistical significance.

Both options are valid. The first one is not always feasible, as the cost of each trial can be high in time and/or other kind of resources.

The standard statistical significance in the industry is 5%, we recommend either that or less.

Formally, this is very similar to a statistical hypothesis testing.

# Back to the Bug

## Testing 20 times

This file has the results found after running our program 5000 times. We must never throw out data, but let's pretend that we have tested our program only 20 times. The observed $$k/n$$ ration and the calculated $$p_{min}$$ evolved as shown in the following graph:

After those 20 tests, our $$p_{min}$$ is about 12%.

Suppose that we fix the bug and test it again. The following graph shows the statistical significance corresponding to the number of tests we do:

In words: we have to test 24 times after fixing the bug to reach 95% statistical significance, and 35 to reach 99%.

Now, what happens if we test more before fixing the bug?

## Testing 5000 times

Let's now use all the results and assume that we tested 5000 times before fixing the bug. The graph bellow shows $$k/n$$ and $$p_{min}$$:

After those 5000 tests, our $$p_{min}$$ is about 23% - much closer to the real $$p$$.

The following graph shows the statistical significance corresponding to the number of tests we do after fixing the bug:

We can see in that graph that after about 11 tests we reach 95%, and after about 16 we get to 99%. As we have tested more before fixing the bug, we found a higher $$p_{min}$$, and that allowed us to test less after fixing the bug.

# Optimal testing

We have seen that we decrease $$t$$ as we increase $$n$$, as that can potentially increases our lower estimate for $$p$$. Of course, that value can decrease as we test, but that means that we "got lucky" in the first trials and we are getting to know the bug better - the estimate is approaching the real value in a non-deterministic way, after all.

But, how much should we test before fixing the bug? Which value is an ideal value for $$n$$?

To define an optimal value for $$n$$, we will minimize the sum $$n+t$$. This objective gives us the benefit of minimizing the total amount of testing without compromising our guarantees. Minimizing the testing can be fundamental if each test costs significant time and/or resources.

The graph bellow shows us the evolution of the value of $$t$$ and $$t+n$$ using the data we generated for our bug:

We can see clearly that there are some low values of $$n$$ and $$t$$ that give us the guarantees we need. Those values are $$n = 15$$ and $$t = 24$$, which gives us $$t+n = 39$$.

While you can use this technique to minimize the total number of tests performed (even more so when testing is expensive), testing more is always a good thing, as it always improves our guarantee, be it in $$n$$ by providing us with a better $$p$$ or in $$t$$ by increasing the statistical significance of the conclusion that the bug is fixed. So, before fixing the bug, test until you see the bug at least once, and then at least the amount specified by this technique - but also test more if you can, there is no upper bound, specially after fixing the bug. You can then report a higher confidence in the solution.

# Conclusions

When a programmer finds a bug that behaves in a non-deterministic way, he knows he should test enough to know more about the bug, and then even more after fixing it. In this article we have presented a framework that provides criteria to define numerically how much testing is "enough" and "even more." The same technique also provides a method to objectively measure the guarantee that the amount of testing performed provides, when it is not possible to test "enough."

We have also provided a real example (even though the bug itself is artificial) where the framework is applied.

## December 08, 2013

### Removed CRANberries

#### Package pathmox (with last version 0.1-1) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2012-03-15 0.1-1
2010-08-10 0.1

### CRANberries

#### New package sglOptim with initial version 0.0.80

Package: sglOptim
Type: Package
Title: Sparse group lasso generic optimizer
Version: 0.0.80
Date: 2013-7-12
Author: Martin Vincent
Maintainer: Martin Vincent
Description: Fast generic solver for sparse group lasso optimization problems. The loss (objective) function must be defined in a C++ module. This package apply template metaprogramming techniques, therefore -- when compiling the package from source -- a high level of optimization is needed to gain full speed (e.g. for the GCC compiler use -O3). Use of multiple processors for cross validation and subsampling is supported through OpenMP. The Armadillo C++ library is used as the primary linear algebra engine.
URL: http://dx.doi.org/10.1016/j.csda.2013.06.004
Depends: R (>= 2.15.0), Matrix,
Collate: 'lambda_sequence.R' 'prepare_args.R' 'sgl_fit.R' 'sgl_config.R' 'sgl_predict.R' 'sgl_cv.R' 'sgl_subsampling.R'
Packaged: 2013-12-08 17:28:28 UTC; martin
NeedsCompilation: yes
Repository: CRAN
Date/Publication: 2013-12-08 20:33:23

## December 07, 2013

### Dirk Eddelbuettel

#### R and Big Data at Big Data Summit at UI Research Park

I spent yesterday at the very enjoyable Big Data Summit held at the University of Illinois Research Park at the edge of the University of Illinois at Urbana-Champaign. campus.

My (short) presentation was part of a panel session on R and Big Data which Doug Simpson of the UIUC Statistics department had put together very well. We heard from a vendor / technology provider with Christopher Nguyen from Adatao talking about their "Big R", from industry with Andy Stevens talking about a number of some real-life challenges with big data at John Deere, from academia with Jonathon Greenberg talking about R and HPC for geospatial research and I added a few short comments and links about R, HPC and Rcpp. My few slides are now up on my talks / presentations page.

Overall, a good day with a number of interesting presentations and of course a number of engaging hallway discussions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

### CRANberries

#### New package mnlogit with initial version 1.0

Package: mnlogit
Type: Package
Title: Multinomial Logit Model
Version: 1.0
Date: 2013-11-11
Suggests: mlogit, nnet
Depends: R (>= 2.10)
Description: Time and memory efficient estimation of multinomial logit models using maximum likelihood method. Numerical optimization performed by Newton-Raphson method using an optimized, parallel C++ library to achieve fast computation of Hessian matrices. Motivated by large scale multiclass classification problems in econometrics and machine learning.
Packaged: 2013-12-07 12:03:32 UTC; ahasan
NeedsCompilation: yes
Repository: CRAN
Date/Publication: 2013-12-07 14:53:31

#### New package sortinghat with initial version 0.1

Package: sortinghat
Title: sortinghat
Version: 0.1
Date: 2013-12-07
Author: John A. Ramey
Maintainer: John A. Ramey
Description: sortinghat is a classification framework to streamline the evaluation of classifiers (classification models and algorithms) and seeks to determine the best classifiers on a variety of simulated and benchmark data sets. Several error-rate estimators are included to evaluate the performance of a classifier. This package is intended to complement the well-known 'caret' package.
Depends: R (>= 3.0.1)
Imports: MASS, bdsmatrix, mvtnorm
Suggests: testthat
LazyData: true
URL: http://github.com/ramhiser/sortinghat
Collate: 'check-arguments.r' 'sortinghat-package.r' 'helper-partition-data.r' 'helper-which-min.r' 'simdata-normal.r' 'simdata-uniform.r' 'covariance-matrices.r' 'helpers.r' 'simdata-t.r' 'simdata-contaminated-normal.r' 'simdata-guo.r' 'simdata-friedman.r' 'simdata-wrapper.r' 'errorest-632.r' 'errorest-632plus.r' 'errorest-apparent.r' 'errorest-bcv.r' 'errorest-boot.r' 'errorest-cv.r' 'errorest-loo-boot.r' 'errorest-wrapper.r'
Packaged: 2013-12-07 08:06:49 UTC; ramhiser
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2013-12-07 10:31:16

## December 06, 2013

### CRANberries

#### New package hflights with initial version 0.1

Package: hflights
Type: Package
Title: Flights that departed Houston in 2011
Version: 0.1
Description: A data only package containing commercial domestic flights that departed Houston (IAH and HOU) in 2011.
Depends: R (>= 2.10)
LazyData: true
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2013-12-07 01:00:06

#### New package elrm with initial version 1.2.2

Package: elrm
Type: Package
Title: Exact Logistic Regression via MCMC
Version: 1.2.2
Date: 2013-12-05
Depends: R(>= 2.7.2), coda, graphics, stats
Author: David Zamar, Jinko Graham, Brad McNeney
Maintainer: David Zamar
Description: elrm implements a Markov Chain Monte Carlo algorithm to approximate exact conditional inference for logistic regression models. Exact conditional inference is based on the distribution of the sufficient statistics for the parameters of interest given the sufficient statistics for the remaining nuisance parameters. Using model formula notation, users specify a logistic model and model terms of interest for exact inference.
URL: http://stat-db.stat.sfu.ca:8080/statgen/research/elrm
Repository: CRAN
Date/Publication: 2013-12-07 00:52:08
Packaged: 2013-12-06 21:00:15 UTC; davidzamar
NeedsCompilation: yes

#### New package bartMachine with initial version 1.0

Package: bartMachine
Type: Package
Title: bartMachine: Bayesian Additive Regression Trees
Version: 1.0
Date: 2013-12-5
Author: Adam Kapelner and Justin Bleich
Description: An advanced implementation of Bayesian Additive Regression Trees with expanded features for data analysis and visualization
Depends: R (>= 2.14.0), rJava (>= 0.8-4), car, randomForest, missForest
SystemRequirements: Java (>= 1.6.27)
Packaged: 2013-12-06 20:22:34 UTC; Kapelner
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2013-12-07 00:58:35

### Revolutions

#### Because it's Friday: ASCII fluid simulator

Here's a neat fluid simulation program that runs entirely in ASCII (via Ryan Grannell):

If you want to try it out yourself you can download the souce code for the program here. Just feed it a text file with an ASCII art drawing of the scenario you want to simulate (more details here). What you may have missed is that in the very first simulation of the video, the simulation scenario is the source code of the program itself:

endoh1 < endoh1.c

Not only does the entire source code fit into one 80x25 screen, it's written to appear as the word FLUID in a bowl ... which you see sloshing around when that very program is run. Wow.

Anyway, that's all for this week. See you on Monday!

#### On the growth of R and Python for data science

A recent article by Matt Asay claims that "Python is displacing R as the language for data science". Python has certainly made some great strides in recent years, evolving beyond a data processing tool (an area where Python excels) to a data analysis tool. The Pandas project, in particular, has greatly expanded Python's ability to handle statistical data sets (introducing an object akin to R's data frame), and added some time series handling tools. But Python is still a long, long way from being able to support the range of statistical procedures supported by the core R language, let alone those provided by the 5000 community-contributed packages in CRAN

Asay's article is heavy on anecdote but light on actual data to support its claim. (ComputerWorld's Sharon Machlis does a great job pointing out the irony there.) Nonetheless, data do exist on R and Python usage; while there's no user-registration data for open-source projects, secondary sources can provide intelligence on how open source projects are being used. RStudio's Hadley Wickham uses data from the developer Q&A site StackOverflow to chart the number of open questions asked per month about R and Python, as a proxy for active usage:

As an general-purpose data processing tool, it's no surprise that Python has more activity than the domain-specific analytics language R. But it's clear that both are growing explosively (Wickham describes the growth as "very close to being exponential"). Looking closer though, we see that the proportion of R questions, as a fraction of Python questions, is also growing rapidly:

This belies the claim that Python is displacing R. In fact, this chart suggests the reverse is true, and that R usage is growing at a faster rate than Python.

More data points come from user surveys. In the 2013 KDNuggets poll of top languages for analytics, data mining and data science, R was the most-used software for the third year running (60.9%), with Python in second place (38.8%). More tellingly, R's usage grew almost four times faster than Python's in 2013 versus 2012 (8.4 percentage points for R, compared to 2.7 percentage points for Python).

It's a similar story on the community side. R has more than 125 active user groups worldwide, and the number of user group meetings has increased by 41% in the last year. Python has around 400 user groups (I couldn't find stats on the growth rate), but RedMonk's Stephen O'Grady compares the communities devoted to data science:

At RedMonk, we typically bet on the bigger community, but that’s not as easy here. Python’s total community is obviously much larger, but it seems probable that R’s community, which is more or less strictly focused on data science, is substantially larger than the subset of the Python community specifically focused on data.

My personal take is that there's more than enough room for both Python and R. As the data science boom continues, both will continue to grow as more and more practitioners enter the world of statistical computing. Some (especially those that come from a computer-science background) will choose Python. And those that come from a statistics or data science background will choose R (or will have already learned R in their studies). And even some that come from the die-hard developer community will end up loving R. But both communities will consider to advance the art of data science, and as open-source communities will inevitably cross-pollinate each other. R has already influenced Python in the realm of data analysis, and it would be no bad thing if Python were to influence R in other areas. That, after all, is the beauty of open source software.

### CRANberries

#### New package Rhpc with initial version 0.13-340.18

Package: Rhpc
Version: 0.13-340.18
Date: 2013-12-06
Title: An R package for High-Performance Computing
Author: Junji Nakano and Ei-ji Nakama
Maintainer: Ei-ji Nakama
Depends: R (>= 3.0.0),parallel
SystemRequirements: GNU make, R built as a shared or static library, MPI library.
Description: Rhpc_lapply, Rhpc_lapplyLB and Rhpc_worker_call using MPI provides better HPC environment on R(works fast on HPC). maybe start from the Rhpc batch command is better(faster).
OS_type: unix
URL: http://prs.ism.ac.jp/~nakama/Rhpc/
BugReports: e-mail:nakama@com-one.com
ByteCompile: true
Packaged: 2013-12-06 09:28:04 UTC; nakama
NeedsCompilation: yes
Repository: CRAN
Date/Publication: 2013-12-06 11:29:38

## December 05, 2013

### CRANberries

#### New package assertthat with initial version 0.1

Package: assertthat
Title: Easy pre and post assertions.
Version: 0.1
Description: assertthat is an extension to stopifnot() that makes it easy to declare the pre and post conditions that you code should satisfy, while also producing friendly error messages so that your users know what they've done wrong.
Suggests: testthat
Collate: 'assert-that.r' 'on-failure.r' 'assertions-file.r' 'assertions-scalar.R' 'assertions.r' 'base.r' 'base-comparison.r' 'base-is.r' 'base-logical.r' 'base-misc.r' 'utils.r' 'validate-that.R'
Roxygen: list(wrap = FALSE)
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2013-12-06 00:51:10

### Revolutions

#### A conversation with Robert Scoble

I had the pleasure of visiting technology guru Robert Scoble at Rackspace Labs a few weeks ago. We had a great conversation about the Big Data revolution, and how large enterprises are catching up with the likes of Google and Facebook when it comes to using their data stores to improve their operations and provide a personalized experience for their customers. The key to making this possible is data science using the R language, and the speed, scalability and enterprise readiness provided by Revolution R Enterprise.

### CRANberries

#### New package Sample.Size with initial version 1.0

Package: Sample.Size
Type: Package
Title: Sample size calculation
Version: 1.0
Date: 2013-12-03
Author: Wei Jiang, Jonathan Mahnken, Matthew Mayo
Maintainer: Wei Jiang
Description: Computes the required sample size using the optimal designs with multiple constraints proposed in Mayo et al.(2010). This optimal method is designed for two-arm, randomized phase II clinical trials, and the required sample size can be optimized either using fixed or flexible randomization allocation ratios.
Packaged: 2013-12-04 17:18:13 UTC; rraghavan
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2013-12-05 18:05:23

#### New package drsmooth with initial version 1.0

Package: drsmooth
Title: Dose-Response Modeling with Smoothing Splines
Description: Drsmooth provides tools for assessing the shape of a dose-response curve by testing linearity and non-linearity at user-defined cut-offs. It also provides two methods of estimating a threshold dose, or the dose at which the dose-response function transitions to significantly increasing: bi-linear (based on pkg:segmented) and smoothed with splines (based on pkg:mgcv).
Version: 1.0
Authors@R: c(person("Greg", "Hixon", role = c("aut", "cph"), email = "ghixon@toxstrategies.com"), person("Anne", "Bichteler", role = c("aut","cre"), email = "abichteler@toxstrategies.com"), person("Chad", "Thompson", role = "ctb", email = "cthompson@toxstrategies.com"), person("Liz", "Abraham", role = "ctb", email = "labraham@toxstrategies.com"))
Maintainer: Anne Bichteler
Depends: R (>= 3.0.1)
Imports: car, clinfun, mgcv, multcomp, pgirmess, DTK, segmented, mvtnorm
Suggests: testthat
LazyData: TRUE
Collate: 'bartlett.r' 'chisquare.r' 'dosefactor.r' 'DRdata.R' 'dunnetts1.r' 'dunnetts2.r' 'dunnettst3.r' 'dunns1.r' 'dunns2.r' 'jonckheere.r' 'lbcd.r' 'nlaad.r' 'nlbcd.r' 'outlier.r' 'pkg_prep.r' 'prelimstats.r' 'segment.r' 'shapiro.r' 'firstDeriv.r' 'noel.r' 'dunnetts.format.r' 'dunns.format.r' 'drsmooth.print.r' 'segment.plot.r' 'segment.print.r' 'spline.plot.r' 'drsmooth-package.R' 'smooth.r'
Packaged: 2013-12-04 19:50:26 UTC; abichteler
Author: Greg Hixon [aut, cph], Anne Bichteler [aut, cre], Chad Thompson [ctb], Liz Abraham [ctb]
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2013-12-05 17:26:56

### Removed CRANberries

#### Package sprint (with last version 1.0.4) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2012-12-13 1.0.4
2012-10-24 1.0.3

### CRANberries

#### New package Rankcluster with initial version 0.91.5

Package: Rankcluster
Type: Package
Title: Model-based clustering for multivariate partial ranking data
Version: 0.91.5
Date: 2013-12-03
Author: Quentin Grimonprez
Maintainer: Quentin Grimonprez
Description: This package proposes a model-based clustering algorithm for ranking data. Multivariate rankings as well as partial rankings are taken into account. This algorithm is based on an extension of the Insertion Sorting Rank (ISR) model for ranking data, which is a meaningful and effective model parametrized by a position parameter (the modal ranking, quoted by mu) and a dispersion parameter (quoted by pi). The heterogeneity of the rank population is modelled by a mixture of ISR, whereas conditional independence assumption is considered for multivariate rankings.
Copyright: Gael Guennebaud, Benoit Jacob are the authors of Eigen (http://eigen.tuxfamily.org) under MPL-2 (See inst/COPYRIGHTS file for more details).
Depends: Rcpp (>= 0.9.14), methods
Collate: 'conversion.R' 'likelihood.R' 'mixtureSEM.R' 'rankclust.R' 'RankDistance.R' 'RankFunctions.R' 'resultClass.R' 'test.R'
Packaged: 2013-12-05 08:24:23 UTC; ripley
NeedsCompilation: yes
Repository: CRAN
Date/Publication: 2013-12-05 09:26:14

### Removed CRANberries

#### Package gearman (with last version 0.1-5) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2013-07-05 0.1-5

#### Package tmg (with last version 0.1) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2012-08-23 0.1

#### Package GUTS (with last version 0.2.8) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2012-07-12 0.2.8
2011-06-22 0.1.45
2011-06-17 0.1

### Revolutions

#### 2013 was a good year for R User Group Meetings

by Joseph Rickert

R user groups are thriving. By our count of events listed on Revolution Analytics' Community Calendar, there were about 441 face-to-face R meetups worldwide in 2013; up 41% from 312 in 2012. The plot below indicates what the activity looks like on a monthly basis.

My take is that user group meetings are doing well because they fulfill basic social and educational needs. I have come to believe that people attend R user group meetings for four main reasons:

1. to enjoy themselves socializing with people who have a common and maybe a bit esoteric interest
2. to learn something new about R that they mitght not otherwise encounter in their everyday work environment
3. to talk about technical ideas in statistics, machine learning and programing that they find exciting
4. to feel like they are connected in the R world

This last reason illustrates a curious aspect of in person meetings: no matter how much effort you may put into being connected with email, blogs and social media, in a room full of people who share a common interest you are bound to learn something new.

Looking forward to 2014, what can we do to make next year’s meetups even better? If you are an organizer here are a few ideas that you may find helpful:

• Apply for a Revolution Analytics 2014 user group grant. Even a modest vector level grant can help with building cohesion for a new group.
• Schedule your meetings at regular intervals. The Bay Area User Group (BARUG) tries to meet regularly on the 2nd Tuesday of the month. We are not rigorous about this but keeping pretty close to this schedule helps members keep the next meeting date "in the back of their minds".
• If you don’t already, try varying the format of the presentations. At BARUG we tend to have to have two formats 45 minute long talks and 12 to 15 minute lightning talks. Most meetings have on long talk and one or two lightning talks. However, the most popular format seems to be an evening of several lightning talks. This kind of event can be time consuming to organize, and difficult to run smoothly but members always respond positively.
• Go for as much variety in the content of the presentations as possible  It is probably the case that everyone who works with R on a regular basis, even very knowledgeable people, tend to use the same libraries and functions. It is delightful to learn something completely new or to see the familiar in some new setting.
• Vary the level of sophistication of the presentations. R Rock Stars do draw big crowds, however some of the most widely enjoyed meetings have been essentially tutorials. Meetings geared to beginners are essential for growth.
• If you would like to have a speaker from Revolution Analytics please write to us at community@revolutionanalytis.com. We can’t make any promises, but if you are close to places where we have offices in the US, Singapore or the UK there is a good chance that we can make some arrangements with enough notice.

If you enjoy attending R User Group meetings then, please consider giving a talk at your local R user group in 2014. Even if you are an absolute R beginner, what better professionally related New Year’s resolution could you aspire to than working towards a short R talk? My guess is that you will find your fellow group members to be very supportive.

## December 04, 2013

### CRANberries

#### New package obs.agree with initial version 1.0

Package: obs.agree
Type: Package
Title: An R package to assess agreement between observers.
Version: 1.0
Date: 2012-09-25
Author: Teresa Henriques, Luis Antunes and Cristina Costa-Santos
Maintainer: Teresa Henriques
Description: The package includes two functions for measuring agreement. Raw Agreement Indices (RAI) to categorical data and Information-Based Measure of Disagreement (IBMD) to continuous data. It can be used for multiple raters and multiple readings cases.
URL: http://www.r-project.org, http://disagreement.med.up.pt/
Packaged: 2013-12-04 15:13:40 UTC; teresahenriques
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2013-12-04 16:28:27

### Removed CRANberries

#### Package Funclustering (with last version 1.0) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2013-08-23 1.0

#### Package Rankcluster (with last version 0.91) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2013-09-27 0.91
2013-09-12 0.90.3
2013-09-05 0.90.2
2013-08-30 0.90.1
2013-08-28 0.89

#### Package blockcluster (with last version 3.0) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2013-08-27 3.0
2013-04-12 2.0.2

#### Package clere (with last version 1.0) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2013-12-03 1.0

### CRANberries

#### New package RefFreeEWAS with initial version 1.01

Package: RefFreeEWAS
Type: Package
Title: EWAS using reference-Free DNA methylation mixture deconvolution
Version: 1.01
Date: 2013-11-26
Author: E. Andres Houseman, Sc.D.
Maintainer: E. Andres Houseman
Depends: R (>= 2.3.12), isva
Description: Reference-free method for conducting EWAS while deconvoluting DNA methylation arising as mixtures of cell types. This method is similar to surrogate variable analysis (SVA and ISVA), except that it makes additional use of a biological mixture assumption.
Packaged: 2013-12-04 00:09:48 UTC; housemae
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2013-12-04 07:34:32

#### New package goft with initial version 1.0

Package: goft
Type: Package
Title: Tests of fit for some probability distributions
Version: 1.0
Date: 2013-10-18
Author: Elizabeth Gonzalez-Estrada, Jose A. Villasenor-Alva
Description: This package implements some tests of fit for the normal, Gumbel (type I extreme value distribution), multivariate normal and generalized Pareto distributions.
Depends: stats, gPdtest, mvShapiroTest
Packaged: 2013-12-03 21:34:44 UTC; elizabeth
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2013-12-04 07:29:16

#### New package babel with initial version 0.2-3

Package: babel
Version: 0.2-3
Title: Ribosome profiling data analysis
Author: Adam B. Olshen, Richard A. Olshen, Barry S. Taylor
Depends: R (>= 2.14.0), edgeR
Imports: parallel
Suggests: R.rsp, R.devices, R.utils
VignetteBuilder: R.rsp
Description: Implements babel routines for identifying unusual ribosome protected fragment counts given mRNA counts
Repository: CRAN
Repository/R-Forge/Project: abo
Repository/R-Forge/Revision: 528
Repository/R-Forge/DateTimeStamp: 2013-12-03 19:48:17
Date/Publication: 2013-12-04 07:41:13
Packaged: 2013-12-03 23:24:37 UTC; rforge
NeedsCompilation: no

## December 03, 2013

### CRANberries

#### New package clere with initial version 1.0

Package: clere
Type: Package
Title: CLERE methodology for simultaneous variables clustering and regression.
Version: 1.0
Date: 2013-11-21
Authors@R: c(person("Loic", "Yengo", role = c("aut", "cre"), email = "loic.yengo@gmail.com"), person("Mickael", "Canouil", role = "ctb", email = "mickael.canouil@good.ibl.fr"))
Description: This package implements the CLERE methodology for simultaneous variables clustering and regression.
Depends: Rcpp (>= 0.10.3), methods
Suggests: grid, ggplot2
Collate: 'fit.clere.R' 'Clere.R' 'sClere-Class.R'
Packaged: 2013-12-03 18:20:18 UTC; root
Author: Loic Yengo [aut, cre], Mickael Canouil [ctb]
Maintainer: Loic Yengo
NeedsCompilation: yes
Repository: CRAN
Date/Publication: 2013-12-03 20:12:20

#### New package seasonal with initial version 0.20.2

Package: seasonal
SystemRequirements: Binary executables of X-13ARIMA-SEATS (installation details included)
Type: Package
Title: R interface to X-13ARIMA-SEATS
Version: 0.20.2
Date: 2013-12-02
Author: Christoph Sax
Maintainer: Christoph Sax
Description: seasonal is an easy-to-use R-interface to X-13ARIMA-SEATS, a seasonal adjustment software developed by the United States Census Bureau. X-13ARIMA-SEATS combines and extends the capabilities of the older X-12ARIMA (developed by the Census Bureau) and the TRAMO-SEATS (developed by the Bank of Spain) software packages. For installation details, see the vignette.
Imports: stringr
Enhances: manipulate