| Title: | Entrywise Splitting Cross-Validation for Factor Models |
|---|---|
| Description: | Implements entrywise splitting cross-validation (ECV) and its penalized variant (pECV) for selecting the number of factors in generalized factor models. |
| Authors: | Zhijing Wang [aut, cre] |
| Maintainer: | Zhijing Wang <[email protected]> |
| License: | GPL-3 |
| Version: | 1.0.1 |
| Built: | 2026-05-26 07:59:29 UTC |
| Source: | https://github.com/cran/pECV |
Data-driven estimation of the constraint constant C in alternating maximization algorithm for continuous data using truncated SVD approach. This function decomposes the data matrix and estimates C based on the maximum row norms.
estimate_C(X, qmax = 8, safety = 1.2)estimate_C(X, qmax = 8, safety = 1.2)
X |
n x p continuous data matrix |
qmax |
Rank for truncated SVD (default 8) |
safety |
Safety parameter for conservative estimation (default 1.2) |
The function performs the following steps: 1. Computes truncated SVD of X with rank qmax 2. Constructs factor matrices A = U * sqrt(D) and B = V * sqrt(D) 3. Calculates row 2-norms for matrices A and B 4. Takes the maximum norm and multiplies by safety parameter
For count data, it is recommended to transform the data using log(X + 1) before applying this function.
A list containing:
qmax |
Truncation rank used |
safety |
Safety parameter applied |
C_norm_hat |
Original maximum row norm |
C_est |
Final conservative estimate of C |
a_norms |
Row norms of factor matrix A |
b_norms |
Row norms of factor matrix B |
# Example 1: Continuous data set.seed(123) n <- 100; p <- 50; q <- 3 theta_true <- matrix(runif(n * q), n, q) A_true <- matrix(runif(p * q), p, q) X <- theta_true %*% t(A_true) + matrix(rnorm(n * p, sd = 0.5), n, p) # Estimate C C_result <- estimate_C(X, qmax = 5) print(C_result$C_est) # Example 2: Count data (apply log transformation) lambda <- exp(theta_true %*% t(A_true)) X_count <- matrix(rpois(n * p, lambda = as.vector(lambda)), n, p) X_transformed <- log(X_count + 1) C_count <- estimate_C(X_transformed, qmax = 5) print(C_count$C_est)# Example 1: Continuous data set.seed(123) n <- 100; p <- 50; q <- 3 theta_true <- matrix(runif(n * q), n, q) A_true <- matrix(runif(p * q), p, q) X <- theta_true %*% t(A_true) + matrix(rnorm(n * p, sd = 0.5), n, p) # Estimate C C_result <- estimate_C(X, qmax = 5) print(C_result$C_est) # Example 2: Count data (apply log transformation) lambda <- exp(theta_true %*% t(A_true)) X_count <- matrix(rpois(n * p, lambda = as.vector(lambda)), n, p) X_transformed <- log(X_count + 1) C_count <- estimate_C(X_transformed, qmax = 5) print(C_count$C_est)
Data-driven estimation of the constraint constant C for binary data using cross-window smoothing and empirical logit transformation.
estimate_C_binary(X, qmax = 8, safety = 1.5, eps = 1e-12, radius = 1)estimate_C_binary(X, qmax = 8, safety = 1.5, eps = 1e-12, radius = 1)
X |
n x p binary data matrix (0/1 values) |
qmax |
Rank for truncated SVD (default 8) |
safety |
Safety parameter for conservative estimation (default 1.5) |
eps |
Small constant to avoid logit divergence when p=0 or p=1 (default 1e-12) |
radius |
Radius for cross-window smoothing (default 1) |
The function performs the following steps: 1. Applies cross-window smoothing to estimate probabilities 2. Performs empirical logit transformation with smoothing 3. Computes truncated SVD of the transformed matrix 4. Constructs matrices A and B and calculates row norms 5. Estimates C as the maximum norm times safety parameter
The cross-window smoothing helps stabilize probability estimates, especially for sparse binary data.
A list containing:
radius |
Cross-window radius used |
qmax |
Truncation rank used |
safety |
Safety parameter applied |
C0 |
Original maximum row norm |
C_est |
Final conservative estimate of C |
a_norms |
Row norms of factor matrix A |
b_norms |
Row norms of factor matrix B |
Mhat |
Logit-transformed matrix |
P_smooth |
Smoothed probability matrix |
N_counts |
Count of values in each smoothing window |
Generate simulated data from a binary (logistic) factor model.
generate_binary_data(n = 100, p = 50, q = 3)generate_binary_data(n = 100, p = 50, q = 3)
n |
Integer. Number of observations. |
p |
Integer. Number of variables. |
q |
Integer. True number of latent factors. |
A named list with components:
Binary matrix (n x p). Generated 0/1 responses.
Integer. True number of factors used in simulation.
Numeric matrix (n x q). True latent factor scores.
Numeric matrix (p x q). True factor loadings.
Numeric vector (length p). Item intercepts.
Generate simulated data from a binary (logistic) factor model with missing values.
generate_binary_data_miss(n = 100, p = 50, q = 3, miss_prop = 0.05)generate_binary_data_miss(n = 100, p = 50, q = 3, miss_prop = 0.05)
n |
Integer. Number of observations. |
p |
Integer. Number of variables. |
q |
Integer. True number of latent factors. |
miss_prop |
Numeric in (0,1). Proportion of missing values (default 0.05). |
A named list with components:
Binary matrix (n x p). Generated 0/1 responses with missing values (NA).
Binary matrix (n x p). Complete data before missingness.
Integer. True number of factors used in simulation.
Numeric matrix. True latent factor scores.
Numeric matrix. True factor loadings.
Numeric vector (length p). Item intercepts.
Numeric. Proportion of entries set to missing.
Generate simulated data from a Gaussian factor model.
generate_continuous_data(n = 100, p = 50, q = 3, noise_sd = 1)generate_continuous_data(n = 100, p = 50, q = 3, noise_sd = 1)
n |
Integer. Number of observations. |
p |
Integer. Number of variables. |
q |
Integer. True number of latent factors. |
noise_sd |
Numeric. Standard deviation of Gaussian noise. |
A named list with components:
Numeric matrix (n x p). Generated observed data.
Integer. True number of factors used in simulation.
Numeric matrix (n x (q+1)). True latent factor scores with intercept.
Numeric matrix (p x (q+1)). True factor loadings.
Generate simulated data from a Gaussian factor model with missing values.
generate_continuous_data_miss( n = 100, p = 50, q = 3, noise_sd = 1, miss_prop = 0.05 )generate_continuous_data_miss( n = 100, p = 50, q = 3, noise_sd = 1, miss_prop = 0.05 )
n |
Integer. Number of observations. |
p |
Integer. Number of variables. |
q |
Integer. True number of latent factors. |
noise_sd |
Numeric. Standard deviation of Gaussian noise. |
miss_prop |
Numeric in (0,1). Proportion of missing values (default 0.05). |
A named list with components:
Numeric matrix (n x p). Generated data with missing values (NA).
Numeric matrix (n x p). Complete data before missingness.
Integer. True number of factors used in simulation.
Numeric matrix (n x (q+1)). True latent factor scores with intercept.
Numeric matrix (p x (q+1)). True factor loadings.
Numeric. Proportion of entries set to missing.
Generate simulated data from a Poisson factor model.
generate_count_data(n = 100, p = 50, q = 3)generate_count_data(n = 100, p = 50, q = 3)
n |
Integer. Number of observations. |
p |
Integer. Number of variables. |
q |
Integer. True number of latent factors. |
A named list with components:
Integer matrix (n x p). Generated Poisson observations.
Integer. True number of factors used in simulation.
Numeric matrix (n x (q+1)). True latent factor scores with intercept.
Numeric matrix (p x (q+1)). True factor loadings.
Generate simulated data from a Poisson factor model with missing values.
generate_count_data_miss(n = 100, p = 50, q = 3, miss_prop = 0.05)generate_count_data_miss(n = 100, p = 50, q = 3, miss_prop = 0.05)
n |
Integer. Number of observations. |
p |
Integer. Number of variables. |
q |
Integer. True number of latent factors. |
miss_prop |
Numeric in (0,1). Proportion of missing values (default 0.05). |
A named list with components:
Integer matrix (n x p). Generated data with missing values (NA).
Integer matrix (n x p). Complete data before missingness.
Integer. True number of factors used in simulation.
Numeric matrix (n x (q+1)). True latent factor scores with intercept.
Numeric matrix (p x (q+1)). True factor loadings.
Numeric. Proportion of entries set to missing.
Uses (Penalized) Entrywise Splitting Cross-Validation (ECV / pECV) to estimate the number of latent factors in generalized factor models.
pECV( resp, C = 5, qmax = 8, fold = 5, tol_val = 0.01, theta0 = NULL, A0 = NULL, seed = 1, data_type = NULL )pECV( resp, C = 5, qmax = 8, fold = 5, tol_val = 0.01, theta0 = NULL, A0 = NULL, seed = 1, data_type = NULL )
resp |
Observation data matrix (n x p); can be continuous, count, or binary. |
C |
Constraint constant, default is 5. |
qmax |
Maximum number of factors to consider, default is 8. |
fold |
Number of folds in cross-validation, default is 5. |
tol_val |
Convergence tolerance, default is 0.01 (interpreted as 0.01 / number of estimated elements). |
theta0 |
Optional initial matrix for factors; sampled from Uniform if not provided. |
A0 |
Optional initial matrix for loadings; sampled from Uniform if not provided. |
seed |
Random seed, default is 1. |
data_type |
Data type, one of "continuous", "count", "binary". If not specified, it is auto-detected. |
The example below may take more than 5 seconds on some machines and is therefore not run during routine checks.
A named list with components:
Integer. Number of factors selected by standard ECV.
Integer. Number of factors selected by ECV with penalty 1.
Integer. Number of factors selected by ECV with penalty 2.
Integer. Number of factors selected by ECV with penalty 3.
Integer. Number of factors selected by ECV with penalty 4.
Numeric vector. Cross-validation loss for each candidate factor number (typically of length qmax).
Character. The detected/used data type: "continuous", "count", or "binary".
The return value has base R types (no special S3/S4 class).
set.seed(123) # Generate count data n <- 50; p <- 50; q <- 2 theta_true <- cbind(1, matrix(runif(n * q, -2, 2), n, q)) A_true <- matrix(runif(p * (q + 1), -2, 2), p, (q + 1)) lambda <- exp(theta_true %*% t(A_true)) resp <- matrix( rpois(length(lambda), lambda = as.vector(lambda)), nrow = nrow(lambda), ncol = ncol(lambda) ) result <- pECV(resp, C = 4, qmax = 4, fold = 5) print(result)set.seed(123) # Generate count data n <- 50; p <- 50; q <- 2 theta_true <- cbind(1, matrix(runif(n * q, -2, 2), n, q)) A_true <- matrix(runif(p * (q + 1), -2, 2), p, (q + 1)) lambda <- exp(theta_true %*% t(A_true)) resp <- matrix( rpois(length(lambda), lambda = as.vector(lambda)), nrow = nrow(lambda), ncol = ncol(lambda) ) result <- pECV(resp, C = 4, qmax = 4, fold = 5) print(result)
Uses (Penalized) Entrywise Splitting Cross-Validation to estimate the number of latent factors in generalized factor models when the data contain missing values.
pECV.miss( resp, C = 5, qmax = 8, fold = 5, tol_val = 0.01, theta0 = NULL, A0 = NULL, seed = 1, data_type = NULL )pECV.miss( resp, C = 5, qmax = 8, fold = 5, tol_val = 0.01, theta0 = NULL, A0 = NULL, seed = 1, data_type = NULL )
resp |
Observation data matrix (n x p) with missing values as |
C |
Constraint constant, default is 5. |
qmax |
Maximum number of factors to consider, default is 8. |
fold |
Number of folds in cross-validation, default is 5. |
tol_val |
Convergence tolerance, default is 0.01 (interpreted as 0.01 / number of estimated elements). |
theta0 |
Optional initial matrix for factors; sampled from Uniform if not provided. |
A0 |
Optional initial matrix for loadings; sampled from Uniform if not provided. |
seed |
Random seed, default is 1. |
data_type |
Data type, one of |
The example below may take more than 5 seconds on some machines and is therefore not run during routine checks.
A named list with components:
Integer. Number of factors selected by standard ECV.
Integer. Number of factors selected by ECV with penalty 1.
Integer. Number of factors selected by ECV with penalty 2.
Integer. Number of factors selected by ECV with penalty 3.
Integer. Number of factors selected by ECV with penalty 4.
Numeric vector. Cross-validation loss for each candidate factor number (typically of length qmax).
Character. The detected/used data type: "continuous", "count", or "binary".
Numeric scalar. Percentage of missing entries in resp.
The return value uses base R types (no special S3/S4 class).
set.seed(123) # Generate count data with missing values n <- 50; p <- 50; q <- 2 theta_true <- cbind(1, matrix(runif(n * q, -2, 2), n, q)) A_true <- matrix(runif(p * (q + 1), -2, 2), p, (q + 1)) lambda <- exp(theta_true %*% t(A_true)) resp <- matrix( rpois(length(lambda), lambda = as.vector(lambda)), nrow = nrow(lambda), ncol = ncol(lambda) ) # Introduce 5% missing values miss_idx <- sample(1:(n * p), size = 0.05 * n * p) resp[miss_idx] <- NA result <- pECV.miss(resp, C = 4, qmax = 4, fold = 5) print(result)set.seed(123) # Generate count data with missing values n <- 50; p <- 50; q <- 2 theta_true <- cbind(1, matrix(runif(n * q, -2, 2), n, q)) A_true <- matrix(runif(p * (q + 1), -2, 2), p, (q + 1)) lambda <- exp(theta_true %*% t(A_true)) resp <- matrix( rpois(length(lambda), lambda = as.vector(lambda)), nrow = nrow(lambda), ncol = ncol(lambda) ) # Introduce 5% missing values miss_idx <- sample(1:(n * p), size = 0.05 * n * p) resp[miss_idx] <- NA result <- pECV.miss(resp, C = 4, qmax = 4, fold = 5) print(result)