B Coded subroutines
This appendix is composed of several small vignettes for code that is not explained in detail elsewhere in the book, but may be downloaded from the book webpage. Each does something that I – or I think you will – find helpful in the course of explaining material or working on homework problems. But, I felt it would be a distraction to cover these in much detail in the main body of a chapter.
B.1 Discrete \(p\)-values
In §1.4 I used a subroutine for \(p\)-value calculation and
visualization with a discrete distribution, specifically the binomial. I
thought you might find it handy in your monty.* “library” functions for your
homework when working with discrete distributions. It was designed to be
somewhat generic, not only for the binomial. One challenge in any \(p\)-value
calculation is knowing what tail of the sampling distribution your observed
test statistic is in. This determines the reflection for visualization, and
sum-based (area under the curve) calculations tallying tail probabilities.
This is what is automated by pval.discrete as provided by
pval_discrete.R on the book
webpage.
If you look in that file, you’ll find the code below, which is pasted here as a
backup, in case you have a printed copy of the book without access to the book
webpage. The file has an additional comment block at the top which explains
how to use the function, and what its arguments are. I’ll re-verbalize those
here. The first, s argument is the observed statistic, a scalar. The
second, x argument represents the domain of the discrete mass, and xd is
its mass (like from xd <- dbinom(x, ...)). Finally, the flag vis lets
the user specify if an optional visual should be provided.
pval.discrete <- function(s, x, xd, vis=FALSE)
{
## some checks
lx <- length(x)
if(lx != length(xd)) stop("x, xd length mismatch")
if(any(xd < 0)) stop("density must be positive")
if(any(diff(x) <= 0)) stop("x must be increasing")
## normalize
eps <- 2/sum(xd)
xd <- xd/sum(xd)
## calculate mode using "lower reflected density" rule
mode <- which.max(xd)
if(s <= x[mode]) { ## stat is in left tail
il <- x <= s ## indices of left tail
ps <- sum(xd[il]) ## left tail pval
d <- max(xd[il]) + eps ## max dens for right tail
ri <- min(which(x >= mode & xd <= d))
r <- x[ri] ## reflected stat value
pr <- sum(xd[ri:lx]) ## right tail pval
} else { ## stat is in the right tail
ir <- x >= s ## indices of right tail
ps <- sum(xd[ir]) ## pval in right tail
d <- max(xd[ir]) + eps ## max dens for left tail
ri <- max(which(x <= mode & xd <= d))
r <- x[ri] ## reflected stat value
pr <- sum(xd[1:ri]) ## left tail pval
}
## pval from both tails
p <- ps + pr
## possibly plot
if(vis) {
## plot empirical sampling density
sfun <- stepfun(x[-1], xd)
ran <- range(c(s, r, x[xd > 0]))
plot(sfun, main="", xlim=ran, xlab="test stat",
ylab="sampling distribution (mass)", lwd=2)
## show test stat and reflection
abline(v=c(s, r), lty=1:2, col=2, lwd=2)
## put legend
legend("bottom", c("s", "reflect"), col=2, lty=1:2, lwd=2, bty="n")
}
## return p-value
return(p)
} The function output is a simple \(p\)-value estimate.
One thing to notice about this subroutine, and others provided in this appendix, is the liberal use of checks and explanations in comments. These are easy to overlook in a first coding pass. But anyone with experience can tell you that they’re saviors in the long run. It’s hard to get in a coding headspace, and remember everything you were thinking when you first wrote some code. Reusing code involves preventing misuse (those sanity checks at the top of the function), and comments that explain each code block (sometimes each line). The most important user of your code is your future self. Be kind to your future self.
B.2 Random ties in ranks
In §9.2 we entertained the possibility of tied ranks in MC
simulations, say for Wilcoxon rank sum tests (§9.2) and
signed ranks tests (§9.3). Similar situations came up
again in Chapter 12. These are not easy to deal with in subset sum
problem (SSP)-based closed-form calculations, but are rather easier
stochastically. The function rank.ties in
rank_ties.R on the book webpage
allows you to randomly generate rank-averaged ties in a permutation of
integers.
The first argument is ranks, which is either 1:n for an \(n\)-sized
sample, or the output of out$ranks in a daisy chain. Outputs are discussed
in more detail below. The second and third arguments specify the nature
of ties requested, via how many (ntie) and at what degree. The final
argument uniq specifies which of ranks remain as unaveraged (untied)
and are available for forming a tie of the desired degree.
rank.ties <- function(ranks, ntie, degree, uniq=ranks)
{
## check for degree > 2
if(degree < 2) stop("must have degree > 2")
## check for enough left to make ties, avoiding infinite loop below
if(length(uniq) < degree*ntie + 1)
stop("too few uniq to create ties")
## check increasing ranks
n <- length(ranks)
if(any(ranks[-1] < ranks[-n]))
stop("ranks must be increasing")
## check uniq is subset of ranks
if(any(!(uniq %in% ranks)))
stop("uniq should be subset of ranks")
## inits
degree <- degree - 1 ## when counting ties, the first one is free
fail <- 0 ## avoid infinite loop
## find each tie
while(ntie > 0) {
## pick one remaining rank at random
u <- sample(uniq, 1)
useq <- u:(u + degree)
## check that its next tie neighbors are there too
if(all(useq %in% uniq)) {
ranks[useq] <- mean(ranks[useq]) ## replace ranks with average
uniq <- setdiff(uniq, useq) ## remove from the uniq list
ntie <- ntie - 1 ## one less tie to find
fail <- 0
} else { ## if not, try again
fail <- fail + 1 ## infinite loop safeguard
if(fail > 10*length(ranks))
stop("unable to find ties")
}
}
## return all ranks and which ones are not tied
return(list(ranks=ranks, uniq=uniq))
}The output is a list object with ranks modified to include any desired
ties, at random, and the remaining list of untied (out$uniq) ranks. These
can be fed in as inputs to another call, in a daisy chain, in order to have
(multiple) ties of multiple degrees. There are many checks in place in this
subroutine to ensure that the ties requested are reasonable given the other
arguments. They are not fool-proof, but cover many situations I
encountered during development. Although ties of varying degree can be
requested in any order, I recommend starting with higher-degree ties first
since those are harder to find. Doing low-degree ties first could make it
impossible to find runs of unaveraged ranks of higher degree.
The location of the ties is chosen at random, but out$ranks are still in the
order provided by the input ranks argument. When used in a MC based on
ranks, one would still need to randomly permute them, as in a loop which had
no ties.
B.3 Subset sum distribution
The Wilcoxon rank sum (§9.2), signed ranks (§9.3), squared ranks (§11.1) and Kruskal-Wallis (KW; §11.2) tests all work with the SSP as a subroutine. Here, I provide details for rank sum and squared ranks tests, leaving signed ranks and KW-testing details to you. First, below is some generic discussion on SSP solving.
SSP Solving
I know of two packages on CRAN that can help with SSP solving. One is FLSSS
(Liu 2025), which is what I used for a long time. While I was busy writing this
book, it seemed that it had been abandoned by its maintainer. It was still on
the archive part of CRAN (the A in the acronym), but it could no longer be
installed with install.packages. Although you could still install the
package from source, I scrambled to find an alternative because I didn’t want
to burden you with that. I found RccpAlgos (Wood 2025), which provides a
similar set of solvers. Since then, FLSSS came back on the main part of
CRAN, with many improvements. I should have been more patient.
The functions in ranks.R on the
book webpage provide an implementation using SSP solvers from both packages,
although the ones from FLSSS are commented out in some cases. At the time
of writing, I had observed that RcppAlgos was faster than I remembered
FLSSS being, because the former used compiled subroutines via Rcpp
(Eddelbuettel and Balamuta 2018). But then, after FLSSS came back from hiatus, I saw that it had
undergone a major upgrade to use a similar kit, including RcppParallel
(J. Allaire et al. 2025). It’s an arms race! For specific statistical testing
problems, like via wilcox.test built into R, there are even faster solvers,
e.g., via pwilcox. These do not deploy general-purpose SSP subroutines,
so they are less easily extendable to other, similar situations (like squared
ranks).
Below, I’ll take you through each subroutine in ranks.R in turn, with brief commentary. These begin with rank sum-specific routines, and then pivot to squared ranks.
Wilcoxon rank sum
First is the density function. Technically this is a mass function, since the
distribution over ranks is discrete, but in R all mass and density functions
start as d*. The k argument is the observed value of the statistic, like
\(\bar{r}\) for rank sum or \(\bar{r}^+\) for signed ranks. Arguments nx and
ny specify the size of the two samples. In this book I have typically
used/labeled those the other way around, first ny and then nx. But I
wrote these before making that choice and decided if it ain’t broke
don’t fix it. You may always call the function with them specified in the
other order, first providing ny and then nx via dranks(k, ny, nx). It
will work fine. Make sure you’re consistent. If your \(\bar{r}\) is from the
sum for the \(y\) group, as in §9.2, then be sure to provide
ny first then nx.
dranks <- function(k, nx, ny)
{
N <- nx + ny
## numer <- length(FLSSS(nx, 1:N, k, 0.1, solutionNeed=sum(1:N)))
numer <- partitionsCount(1:N, nx, target=k)
denom <- choose(N, nx)
return(as.numeric(numer/denom))
}Here are a few other details for dranks. Notice that the combined sample
size, ny + nx is called N not n. I wrote this before considering MC
solutions, involving \(N\). The choice for the symbol, N , doesn’t affect
anything since it’s a variable in the
namespace local to the dranks
function scope. The partitionCount function is what solves the SSP. Read
that as finding a subset of size nx using integers 1:N targeting a sum of
k. That function counts how many ways that can be done, and then this
number is normalized by \({N \choose n_x}\) to obtain the mass. Alternatively,
FLSSS may be used by swapping comments. This one has a clunkier syntax.
Next, pranks evaluates the CDF, and is implemented simply as a sum of
dranks from the left. There’s not that much else to it.
pranks <- function(k, nx, ny)
{
## nothing to do?
if(k == 0) return(0)
## sum up all density evaluations from 1:k
p <- 0
for(i in 1:k) p <- p + dranks(i, nx, ny)
## done
return(p)
}Finally, the code below provides a version of wilcox.test that uses pranks
rather than pwilcox. It doesn’t cover all cases. I really just wrote it to
check that I was getting the same results as wilcox.test. In particular, it
doesn’t allow you to do a signed ranks test, though it could be modified to do
so. (See homework exercises in §9.5.) I decided to
name it after Mann and Whitney to share the love, since they came up with a
similar method around the time of Wilcoxon, but they generally
don’t get as much name recognition.
mannwhitney.test <- function(x, y,
alternative=c("two-tailed", "less", "greater"))
{
## check alternative argument
alternative <- match.arg(alternative)
## extract lengths
nx <- length(x)
ny <- length(y)
N <- nx + ny
## get ranks
r <- rank(c(x, y))
rx <- r[seq_along(x)]
## check for ties and possibly warn
if(length(r) - length(unique(r)) > 0.1*N)
warning("more than 10% ties, suggest using normal approximation")
## calculate test statistic
t <- sum(rx)
## go through alternatives for p-value
if(alternative == "less") phi <- pranks(t, nx, ny)
else if(alternative == "greater") phi <- 1 - pranks(t - 1, nx, ny)
else { ## two-sided
phi <- 2*min(pranks(t, nx, ny), 1 - pranks(t - 1, nx, ny))
}
## return results
return(list(stat=t, p.value=phi, alternative=alternative))
}As above, it doesn’t matter if you flip x and y arguments.
Squared ranks
This test is due to Conover (1999), and my implementation is based on
the verbal description provided therein. First is the density function. It helps
to externally normalize that density when called repeatedly, for example
when plotting like in Figure 11.1. That’s what’s facilitated
by dsqranks.denom below.
dsqranks.denom <- function(nx, ny)
{
N <- nx + ny
r2 <- (1:N)^2
sr2 <- sum(r2)
denom <- 0
for(i in 1:sr2) { ## expensive, could be faster if approximate
## solve subset sum problem
## denom <- denom + length(FLSSS(nx, r2, i, 0.1, solutionNeed=sr2))
denom <- denom + nrow(partitionsGeneral(r2, nx, target=i))
}
return(denom)
}Notice how it loops over every possible squared rank from 1 to the sum of
ny + nx squared. A calculation based on FLSSS is commented out, as with
earlier functions for Wilcoxon tests. The active code is based on a routine
from RcppAlgos, but not the same routine as for Wilcoxon tests. For some
reason, partitionsCount does not work in this context. It spits out a
warning which is a little cryptic, but I deduced that it would rather you use
partitionsGeneral to enumerate all solutions. Actual solutions aren’t
required by this application, only their count.
The dsqranks function, calculating the mass, can take a pre-calculated
normalization, or it can calculate its own normalization (default). Notice
how it’s a special case of the dsqranks.denom function. Normalization is
where most of the work is involved, computationally speaking.
dsqranks <- function(k, nx, ny, denom=NULL)
{
N <- nx + ny
r2 <- (1:N)^2
## solve subset sum problem
## numer <- length(FLSSS(nx, r2, k, 0.1, solutionNeed=sum(r2)))
numer <- nrow(partitionsGeneral(r2, nx, target=k))
## offload to another function in case pre-calculated
if(is.null(denom)) denom <- dsqranks.denom(nx, ny)
return(as.numeric(numer/denom))
}Calculating mass isn’t useful directly for testing, but can help with
visualization. Testing requires summing over tail probabilities to
calculate \(p\)-values, as provided by psqranks. This is the same as
evaluating the mass over all tail events. Although that could involve calls
to dsqranks, I found it more expedient to extract out the important aspects
of dsqranks and its normalization.
psqranks <- function(k, nx, ny)
{
## nothing to do?
if(k == 0) return(0)
## avoiding duplicated denom computation in multiple dsqranks calls
N <- nx + ny
r2 <- (1:N)^2
sr2 <- sum(r2)
## calculate the common denominator to all density calculations
denom <- dsqranks.denom(nx, ny)
## sum up all density evaluations from 1:k
p <- 0
for(i in 1:k) {
## p <- p + length(FLSSS(nx, r2, i, 0.1, solutionNeed=sr2))/denom
p <- p + as.numeric(nrow(partitionsGeneral(r2, nx, target=i))/denom)
}
## done
return(p)
}To check that everything was working, I built a squared ranks test wrapper, which is provided below.
sqranks.test <- function(x, y,
alternative=c("two-tailed", "less", "greater"))
{
## check alternative argument
alternative <- match.arg(alternative)
## data dimensions
nx <- length(x)
ny <- length(y)
N <- nx + ny
## calculate absolute discrepancies from the (estimated) mean
xbar <- mean(x)
ybar <- mean(y)
u <- abs(x - xbar)
v <- abs(y - ybar)
## calculate the observed ranks, and t
r <- rank(c(u, v))
ru <- r[seq_along(u)]
rv <- r[-seq_along(u)]
## check for ties and possibly warn
if(length(r) - length(unique(r)) > 0.1*N)
warning("more than 10% ties, suggest using normal approximation")
## calculate test statistic
t <- sum(ru^2)
## go through alternatives for p-value
if(alternative == "less") phi <- psqranks(t, nx, ny)
else if(alternative == "greater") phi <- 1 - psqranks(t-1, nx, ny)
else { ## two-sided
phi <- 2*min(psqranks(t, nx, ny), 1 - psqranks(t-1, nx, ny))
}
## return results
return(list(stat=t, p.value=phi, alternative=alternative))
}The output of this function matches what conover(..., do.exact=TRUE)
provides, using a library function from ANSM5 (Spencer 2024) on CRAN. Like the
shoutout to pwilcox and wilcox.test above, conover is faster than
sqranks.test because its implementation uses purpose-built (rather than
general purpose) subroutines to enumerate sums of ranks. To be honest, I
couldn’t figure out what was going on under the hood in conover. One upside
to sqranks.test is that it’s super straightforward and easy to adapt to
other similar tests.
References
RcppParallel: Parallel Programming Tools for Rcpp. https://doi.org/10.32614/CRAN.package.RcppParallel.
Rcpp.” The American Statistician 72 (1): 28–36. https://doi.org/10.1080/00031305.2017.1375990.
FLSSS: Mining Rigs for Problems in the Subset Sum Family. https://doi.org/10.32614/CRAN.package.FLSSS.
ANSM5: Functions and Data for the Book “Applied Nonparametric Statistical Methods,” 5th Edition. https://doi.org/10.32614/CRAN.package.ANSM5.
RcppAlgos: High Performance Tools for Combinatorics and Computational Mathematics. https://doi.org/10.32614/CRAN.package.RcppAlgos.