The use of longitudinal data sets leads to massively high-dimensional
multivariate regression models because each single predictor (aka,
covariate) over time and/or space must be treated as a large number of
predictor variables in the regression model. Discovering unknown group
structure automatically from the data is fundamentally important since
the group structure leads to simple, parsimonious and interpretable
models. This vignette describes the usage of longitudinal coefficient
tree regression (LCTR) in R
. See the Surer et. al. (2022)
reference paper listed below for details.
We consider the linear regression model (the general version, with ungrouped coefficients) with \(n\) sampling units and \(p\) predictors, each of which is measured at the same \(L\) time points: \[\begin{align} \begin{split} \mathbf{y} = &\beta_{1,1}\mathbf{x}_{1,1} + \cdots + \beta_{L,1}\mathbf{x}_{L,1} + \\ & \beta_{1,2} \mathbf{x}_{1,2} + \cdots + \beta_{L,2}\mathbf{x}_{L,2} + \cdots + \\ & \beta_{1,p} \mathbf{x}_{1,p} + \cdots + \beta_{L,p}\mathbf{x}_{L,p} + \boldsymbol{\epsilon}, \end{split} \end{align}\] where \(\mathbf{y} = \big[y_1, y_{2}, \ldots, y_{n}\big]^\mathsf{T}\), \(\mathbf{x}_{t,j} = [x_{1,t,j}, x_{2,t,j}, \ldots, x_{n,t,j}]^\mathsf{T}\) for \(t = 1, \ldots, L\) (our convention is that time \(t = L\) corresponds to most recent, and time \(t = 1\) to most distant) and \(j = 1, \ldots, p\), and \(\boldsymbol{\epsilon}\) are length-\(n\) vectors of the response observations, the predictor observations, and the independent and identically distributed noise with a constant variance, respectively.
library(LongCTR)
LongCTR
sequentially adds derived predictors (the sum of
predictors in groups) to the model. This process continues until we
obtain a final set of derived predictors satisfying the termination
criterion. Users might pick a value Kmax
up front for the
total number of levels in the fitted LCTR tree at termination.
Alternatively, the best model size can be chosen using K-fold cross
validation (CV). Users may change the cv
option depending
on their preferences. If CV is used, a user must select
Kmax
as the upper bound on the best model size. The default
is Kmax = 20
. However, in some cases a better model might
be obtained with Kmax > 20
. In this case, the warm-start
feature allows users to input the existing LongCTR
object
to the continuation of the LongCTR
.
LongCTR(X, L, y, Kmax=5, cv=0, warm.start=NULL)
Input | Definition |
---|---|
L |
Time points |
X |
The \(n \times (p \times L)\) design matrix of predictor observations to which the LCTR is fit. |
y |
The \(n\) dimensional vector of the response observations. |
Kmax |
A numeric scalar that specifies the upper bound on the best model
size. If cv = 0 , it specifies the model size at
termination. |
cv |
A numeric scalar that specifies the number of folds used in \(K\)-fold CV. If cv = 0 , the
argument allows users to fit the LCTR model with Kmax
derived predictors at termination without using CV. |
warm.start |
If NULL , start fitting LCTR model from scratch.
Otherwise, input object of class longCTR for the use of
warm-start strategy. |
We generate a dataset with \(p = 3\), \(L = 10\) and \(n = 1,500\), where \(\mathbf{X}\) is the \(n \times (p \times L)\) design matrix of predictor observations and \(\mathbf{y}\) is the \(n\) dimensional vector of the response observations. We use this dataset throughout the vignette:
#Generate some data
set.seed(1234)
<- 3
p <- 10
L <- 1500
n <- scale(matrix(rnorm(n * p * L), nrow = n))
X
<- c(0, 0, 0, 0, 3, 3, 3, 3.5, 3.5, 3.5,
beta 0, 0, 0, 0, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8,
1, 1, 1, 1, 3, 3 ,3 , 3.2, 3.2, 3.2)
<- ((1/0.9) - 1)*(t(beta)%*%cov(X)%*%beta)
sigmasq <- X %*% beta + rnorm(n, mean = 0, sd = sqrt(sigmasq)) y
We first illustrate the usage of the most basic call to
LongCTR
without using CV. As mentioned above
LongCTR
requires a termination criterion that represents
the number of levels in the tree at termination. The best model size can
be obtained via \(K\)-fold CV. However,
in some situations, users might want to obtain only the prespecified
number of levels instead of finding the best model size via CV.
In this example, we set cv = 0
and Kmax = 5
assuming that we can obtain an interpretable and accurate model with 5
levels in the tree structure at termination.
#Fit LongCTR model
<- LongCTR(X, L, y, Kmax=5, cv=0, warm.start=NULL) LCTR.fit
#Summarize the fitted LongCTR model
summary.grpstr(LCTR.fit$groupstrlist)
#> Coefficient Tree Regression
#>
#> Final group structure:
#>
#> (Group)(alpha)(Pred Ids)(Time Ids)
#> (1)(3.5178){1}{8-10}(TRUE)
#> (2)(0.7935){2}{5-10}(TRUE)
#> (3)(1.0542){3}{1-4}(TRUE)
#> (4)(0.3256){2}{4-4}(TRUE)
#> (5)(3.1164){3}{8-10}(TRUE)
#> (6)(2.946){1,3}{5-7}(TRUE)
#> (7)(0){1}{4-4}(FALSE)
#> (8)(0){1,2}{1-3}(FALSE)
As mentioned above LCTR results in a highly-interpretable tree
structure representing the sequence of group structures produced after
each iteration. The hierarchical nature of the LCTR provides insight
into the importance of the predictors, their higher-level grouping
structures, and how they relate to each other. print
displays a visual representation of the fitted LCTR object as
follows:
print.grpstr(LCTR.fit$groupstrlist, L=10, K=5)
#> (Group)(alpha){PredIds}{TimeIds}
#> (1.1)(NA){1,2,3}{1-10}
#> (2.1)(3.1251){1,3}{5-10}
#> (3.1)(3.128){1,3}{5-10}
#> (4.1)(3.1345){1,3}{5-10}
#> (5.1)(3.5256){1}{8-10}
#> (5.5)(3.1151){3}{8-10}
#> (5.6)(2.9462){1,3}{5-7}
#> (2.2)(0){2}{5-10}
#> (3.2)(0){2}{5-10}
#> (4.2)(0.8079){2}{5-10}
#> (5.2)(0.7979){2}{5-10}
#> (2.3)(0){1,2,3}{1-4}
#> (3.3)(1.0251){3}{1-4}
#> (4.3)(1.0449){3}{1-4}
#> (5.3)(1.0453){3}{1-4}
#> (3.4)(0){1,2}{1-4}
#> (4.4)(0){1,2}{1-4}
#> (5.4)(0){1,2}{1-4}
print.tree(LCTR.fit$groupstrlist, K=5)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Predictions can be made based on the fitted LongCTR
object. newdata
is for the new input matrix.
<- predict.groupstr(LCTR.fit$groupstr, newdata=X[1:5,])
yhat
yhat#> [,1]
#> [1,] 1.646212
#> [2,] -8.393045
#> [3,] -10.704253
#> [4,] -14.081439
#> [5,] -6.954895
LongCTR
requires a termination criterion to determine
the number of derived predictors in the model at termination. Note that
the model size is the only tuning parameter for LCTR, and this can be
chosen using \(K\)-fold CV. If
cv > 1
, the package performs \(K\)-fold CV, where \(K\) = cv
, to find the best
model size, and then fits the final LCTR
model of this
chosen size using all the data. In CV, a user must select an upper bound
Kmax
on the best model size.
LongCTR(X, y, Kmax = 20, cv = 10, warm.start = NULL)
We still act on the sample data loaded before. In this example, we
set cv = 10
to perform 10-fold CV. We use
Kmax = 20
as an upper bound on the total number of derived
predictors at termination.
<- LongCTR(X, L, y, Kmax=20, cv=10, warm.start=NULL) cvLCTR.fit