Package 'mt'

Title:	Metabolomics Data Analysis Toolbox
Description:	Functions for metabolomics data analysis: data preprocessing, orthogonal signal correction, PCA analysis, PCA-DA analysis, PLS-DA analysis, classification, feature selection, correlation analysis, data visualisation and re-sampling strategies.
Authors:	Wanchang Lin
Maintainer:	Wanchang Lin <[email protected]>
License:	GPL (>= 2)
Version:	2.0-1.20
Built:	2025-03-08 04:00:44 UTC
Source:	https://github.com/wanchanglin/mt

Help Index

abr1 Data
Estimate Classification Accuracy By Resampling Method
Binary Classification
Calculate .632 and .632+ Bootstrap Error Rate
Boxplot Method for Class 'frankvali'
Boxplot Method for Class 'maccest'
Assess Classification Performances
Wrapper Function for Classifiers
Correlation Analysis Utilities
Generate Pairwise Data Set
Grouped Data Visualisation by PCA, MDS, PCADA and PLSDA
Summary Utilities
Rank aggregation by Borda count algorithm
Frequency and Stability of Feature Selection
Multiple Feature Selection
Feature Ranking with Resampling Method
Feature Ranking and Validation on Feature Subset
Estimates Feature Ranking Error Rate with Resampling
Feature Selection Using ANOVA
Feature Selection Using Area under Receiver Operating Curve (AUC)
Feature Selection Using Between-Group to Within-Group (BW) Ratio
Feature Selection Using Kruskal-Wallis Test
Feature Selection by PCA
Feature Selection Using PLS
Feature Selection Using RELIEF Method
Feature Selection Using Random Forests (RF)
Feature Selection Using SVM-RFE
Feature Selection Using Signal-to-Noise Ratio (SNR)
Feature Selection Using Welch Test
Feature Selection Using Wilcoxon Test
Get Length of Feature Subset for Validation
Plot Matrix-Like Object by Group
List Manipulation Utilities
Estimation of Multiple Classification Accuracy
Binary Classification by Multiple Classifier
Multiple Comparison by 'ANOVA' and Pairwise Comparison by 'HSDTukey Test'
Multiple Comparison by 'Friedman Test' and Pairwise Comparison by 'Wilcoxon Test'
Normality Test by Shapiro-Wilk Test
Plot Classical Multidimensional Scaling
Missing Value Utilities
Orthogonal Signal Correction (OSC)
Orthogonal Signal Correction (OSC) Approach by Sjoblom et al.
Orthogonal Signal Correction (OSC) Approach by Wise and Gallagher.
Orthogonal Signal Correction (OSC) Approach by Wold et al.
Panel Function for Plotting Ellipse and outlier
Panel Function for Plotting Regression Line
Outlier detection by PCA
Classification with PCADA
Plot Function for PCA with Grouped Values
Plot Method for Class 'accest'
Plot Method for Class 'maccest'
Plot Method for Class 'pcalda'
Plot Method for Class 'plsc' or 'plslda'
Classification with PLSDA
Predict Method for Class 'osc'
Predict Method for Class 'pcalda'
Predict Method for Class 'plsc' or 'plslda'
Pre-process Data Set
P-values Utilities
Save List of Data Frame or Matrix into CSV File
Statistical Summary Utilities for Two-Classes Data
Generate Index of Training Samples
Functions for Tuning Appropriate Number of Components
Generate Control Parameters for Resampling

abr1 Data

Description

An FIE-MS data.

Usage

data(abr1)
data(abr1)

Details

abr1 is an FIE-MS data matrices developed from analysis of samples representing a time course of pathogen attack in a model plant species (Brachypodium distachyon). The data was developed in a single batch with all samples randomised using a Thermo LTQ linear ion trap. Both positive and negative ion mode are given (abr1$pos and abr1$neg).

Value

A list with the following elements:

`fact`	A data frame containing experimental meta-data.
`pos`	A data frame for positive data with 120 observations and 2000 variables.
`neg`	A data frame for negative data with 120 observations and 2000 variables.

Examples

# Load data set
data(abr1)

# Select data set
dat <- abr1$neg

# number of observations and variables
dim(dat)

# Transform data
dat.log   <- preproc(dat, method = "log")
dat.sqrt  <- preproc(dat, method = "sqrt")
dat.asinh <- preproc(dat, method = "asinh")

op <- par(mfrow=c(2,2), pch=16)
matplot(t(dat),main="Original",type="l",col="blue",
     ylab="Intensity")
matplot(t(dat.log),main="Log",type="l",col="green",
     ylab="Intensity")
matplot(t(dat.sqrt),main="Sqrt",type="l",col="red",
     ylab="Intensity")
matplot(t(dat.asinh),main="ArcSinh)",type="l",col="black",
     ylab="Intensity")
par(op)
mtext("Data set", line=2.5, font=3, cex=1.5)

# Load data set
data(abr1)

# Select data set
dat <- abr1$neg

# number of observations and variables
dim(dat)

# Transform data
dat.log   <- preproc(dat, method = "log")
dat.sqrt  <- preproc(dat, method = "sqrt")
dat.asinh <- preproc(dat, method = "asinh")

op <- par(mfrow=c(2,2), pch=16)
matplot(t(dat),main="Original",type="l",col="blue",
     ylab="Intensity")
matplot(t(dat.log),main="Log",type="l",col="green",
     ylab="Intensity")
matplot(t(dat.sqrt),main="Sqrt",type="l",col="red",
     ylab="Intensity")
matplot(t(dat.asinh),main="ArcSinh)",type="l",col="black",
     ylab="Intensity")
par(op)
mtext("Data set", line=2.5, font=3, cex=1.5)

Estimate Classification Accuracy By Resampling Method

Description

Estimate classification accuracy rate by resampling method.

Usage

accest(dat, ...)

## Default S3 method:
accest(dat, cl, method, pred.func=predict,pars = valipars(), 
       tr.idx = NULL, ...) 

## S3 method for class 'formula'
accest(formula, data = NULL, ..., subset, na.action = na.omit)

aam.cl(x,y,method, pars = valipars(),...)

aam.mcl(x,y,method, pars = valipars(),...)

accest(dat, ...)

## Default S3 method:
accest(dat, cl, method, pred.func=predict,pars = valipars(), 
       tr.idx = NULL, ...) 

## S3 method for class 'formula'
accest(formula, data = NULL, ..., subset, na.action = na.omit)

aam.cl(x,y,method, pars = valipars(),...)

aam.mcl(x,y,method, pars = valipars(),...)

Arguments

`formula`	A formula of the form `groups ~ x1 + x2 + ...` That is, the response is the grouping factor and the right hand side specifies the (non-factor) discriminators.
`data`	Data frame from which variables specified in `formula` are preferentially to be taken.
`dat`, `x`	A matrix or data frame containing the explanatory variables if no formula is given as the principal argument.
`cl`, `y`	A factor specifying the class for each observation if no formula principal argument is given.
`method`	Classification method whose accuracy rate is to be estimated, such as `randomForest`, `svm`, `knn` and `lda`. For details, see `note` below. Either a function or a character string naming the function to be called.
`pred.func`	Predict method (default is `predict`). Either a function or a character string naming the function to be called.
`pars`	A list of parameters using by the resampling method such as Leave-one-out cross-validation, Cross-validation, Bootstrap and Randomised validation (holdout). See `valipars` for details.
`tr.idx`	User defined index of training samples. Can be generated by `trainind`.
`...`	Additional parameters to `method`.
`subset`	Optional vector, specifying a subset of observations to be used.
`na.action`	Function which indicates what should happen when the data contains `NA`'s, defaults to `na.omit`.

Details

The accuracy rates of classification are estimated by techniques such as Random Forest, Support Vector Machine, k-Nearest Neighbour Classification and Linear Discriminant Analysis based on resampling methods, including Leave-one-out cross-validation, Cross-validation, Bootstrap and Randomised validation (holdout).

Value

accest returns an object including the components:

`method`	Classification method used.
`acc`	Overall accuracy rate.
`acc.iter`	Average accuracy rate for each iteration.
`acc.all`	Accuracy rate for each iteration and replication.
`auc`	Overall area under receiver operating curve (AUC).
`auc.iter`	Average AUC for each iteration.
`auc.all`	AUC for each iteration and replication.
`mar`	Overall prediction margin.
`mar.iter`	Average prediction margin for each iteration.
`mar.all`	Prediction margin for each iteration and replication.
`err`	Overall error rate.
`err.iter`	Average error rate for each iteration.
`err.all`	Error rate for each iteration and replication.
`sampling`	Sampling scheme used.
`niter`	Number of iteration.
`nreps`	Number of replications in each iteration if resampling is not `loocv`.
`conf`	Overall confusion matrix.
`res.all`	All results which can be further processed.
`acc.boot`	A list of bootstrap accuracy such as `.632` and `.632+` if the resampling method is bootstrap.

aam.cl returns a vector with acc (accuracy), auc(area under ROC curve) and mar(class margin).

aam.mcl returns a matrix with columns of acc (accuracy), auc(area under ROC curve) and mar(class margin).

Note

The accest can take any classification models if their argument format is model(formula, data, subset, na.action, ...) and their corresponding method predict.model(object, newdata, ...) can either return the only predicted class label or a list with a component called class, such as lda and pcalda.

If classifier method provides posterior probabilities, the prediction margin mar will be generated, otherwise NULL.

If classifier method provides posterior probabilities and the classification is for two-class problem, auc will be generated, otherwise NULL.

aam.cl is a wrapper function of accest, returning accuracy rate, AUC and classification margin. aam.mcl accepts multiple classifiers in one running.

Author(s)

Wanchang Lin

Examples

# Iris data
data(iris)
# Use KNN classifier and bootstrap for resampling
acc <- accest(Species~., data = iris, method = "knn",
              pars = valipars(sampling = "boot",niter = 2, nreps=5))
acc
summary(acc)
acc$acc.boot

# alternatively the traditional interface:
x <- subset(iris, select = -Species)
y <- iris$Species

## -----------------------------------------------------------------------
# Random Forest with 5-fold stratified cv 
pars   <- valipars(sampling = "cv",niter = 4, nreps=5, strat=TRUE)
tr.idx <- trainind(y,pars=pars)
acc1   <- accest(x, y, method = "randomForest", pars = pars, tr.idx=tr.idx)
acc1
summary(acc1)
# plot the accuracy in each iteration
plot(acc1)

## -----------------------------------------------------------------------
# Forensic Glass data in chap.12 of MASS
data(fgl, package = "MASS")    # in MASS package
# Randomised validation (holdout) of SVM for fgl data
acc2 <- accest(type~., data = fgl, method = "svm", cost = 100, gamma = 1, 
              pars = valipars(sampling = "rand",niter = 10, nreps=4,div = 2/3) )
              
acc2
summary(acc2)
# plot the accuracy in each iteration
plot(acc2)

## -----------------------------------------------------------------------
## Examples of amm.cl and aam.mcl
aam.1 <- aam.cl(x,y,method="svm",pars=pars)
aam.2 <- aam.mcl(x,y,method=c("svm","randomForest"),pars=pars)

## If use two classes, AUC will be calculated
idx <- (y == "setosa")
aam.3 <- aam.cl(x[!idx,],factor(y[!idx]),method="svm",pars=pars)
aam.4 <- aam.mcl(x[!idx,],factor(y[!idx]),method=c("svm","randomForest"),pars=pars)

# Iris data
data(iris)
# Use KNN classifier and bootstrap for resampling
acc <- accest(Species~., data = iris, method = "knn",
              pars = valipars(sampling = "boot",niter = 2, nreps=5))
acc
summary(acc)
acc$acc.boot

# alternatively the traditional interface:
x <- subset(iris, select = -Species)
y <- iris$Species

## -----------------------------------------------------------------------
# Random Forest with 5-fold stratified cv 
pars   <- valipars(sampling = "cv",niter = 4, nreps=5, strat=TRUE)
tr.idx <- trainind(y,pars=pars)
acc1   <- accest(x, y, method = "randomForest", pars = pars, tr.idx=tr.idx)
acc1
summary(acc1)
# plot the accuracy in each iteration
plot(acc1)

## -----------------------------------------------------------------------
# Forensic Glass data in chap.12 of MASS
data(fgl, package = "MASS")    # in MASS package
# Randomised validation (holdout) of SVM for fgl data
acc2 <- accest(type~., data = fgl, method = "svm", cost = 100, gamma = 1, 
              pars = valipars(sampling = "rand",niter = 10, nreps=4,div = 2/3) )
              
acc2
summary(acc2)
# plot the accuracy in each iteration
plot(acc2)

## -----------------------------------------------------------------------
## Examples of amm.cl and aam.mcl
aam.1 <- aam.cl(x,y,method="svm",pars=pars)
aam.2 <- aam.mcl(x,y,method=c("svm","randomForest"),pars=pars)

## If use two classes, AUC will be calculated
idx <- (y == "setosa")
aam.3 <- aam.cl(x[!idx,],factor(y[!idx]),method="svm",pars=pars)
aam.4 <- aam.mcl(x[!idx,],factor(y[!idx]),method=c("svm","randomForest"),pars=pars)

Binary Classification

Description

Binary classification.

Usage

  binest(dat, cl, choices = NULL, method, pars=valipars(),...) 
binest(dat, cl, choices = NULL, method, pars=valipars(),...)

Arguments

`dat`	A matrix or data frame containing the explanatory variables.
`cl`	A factor specifying the class for each observation.
`choices`	The vector or list of class labels to be chosen for binary classification. For details, see `dat.sel`.
`method`	Classification method to be used. For details, see `accest`.
`pars`	A list of parameters of the resampling method. For details, see `valipars`.
`...`	Additional parameters to `method`.

Value

A list with components:

`com`	A matrix of combination of the binary class labels.
`acc`	A table of classification accuracy for the binary combination in each iteration.
`method`	Classification method used.
`sampling`	Sampling scheme used.
`niter`	Number of iterations.
`nreps`	Number of replications in each iteration if resampling is not `loocv`.

Author(s)

Wanchang Lin

Examples

# iris data set
data(iris)
dat <- subset(iris, select = -Species)
cl  <- iris$Species

## PCALDA with cross-validation
pars    <- valipars(sampling="cv",niter = 6, nreps = 5)
binpcalda <- binest(dat,cl,choices=c("setosa"), method="pcalda", pars = pars)

## SVM with leave-one-out cross-validation. SVM kernel is 'linear'.
pars   <- valipars(sampling="loocv")
binsvm <- binest(dat,cl,choices=c("setosa","virginica"), method="svm",
                 pars = pars, kernel="linear")

## randomForest with bootstrap
pars  <- valipars(sampling="boot",niter = 5, nreps = 5)
binrf <- binest(dat,cl,choices=c("setosa","virginica"), 
                method="randomForest", pars = pars)

## KNN with randomised validation. The number of neighbours is 3.
pars   <- valipars(sampling="rand",niter = 5, nreps = 5)
binknn <- binest(dat,cl,choices = list(c("setosa","virginica"),
                                       c("virginica","versicolor")), 
                 method="knn",pars = pars, k = 3)

# iris data set
data(iris)
dat <- subset(iris, select = -Species)
cl  <- iris$Species

## PCALDA with cross-validation
pars    <- valipars(sampling="cv",niter = 6, nreps = 5)
binpcalda <- binest(dat,cl,choices=c("setosa"), method="pcalda", pars = pars)

## SVM with leave-one-out cross-validation. SVM kernel is 'linear'.
pars   <- valipars(sampling="loocv")
binsvm <- binest(dat,cl,choices=c("setosa","virginica"), method="svm",
                 pars = pars, kernel="linear")

## randomForest with bootstrap
pars  <- valipars(sampling="boot",niter = 5, nreps = 5)
binrf <- binest(dat,cl,choices=c("setosa","virginica"), 
                method="randomForest", pars = pars)

## KNN with randomised validation. The number of neighbours is 3.
pars   <- valipars(sampling="rand",niter = 5, nreps = 5)
binknn <- binest(dat,cl,choices = list(c("setosa","virginica"),
                                       c("virginica","versicolor")), 
                 method="knn",pars = pars, k = 3)

Calculate .632 and .632+ Bootstrap Error Rate

Description

Calculate .632 bootstrap and .632 plus bootstrap error rate.

Usage

boot.err(err, resub)       
boot.err(err, resub)

Arguments

`err`	Average error rate of bootstrap samples.
`resub`	A list including apparent error rate, class label and the predicted class label of the original training data (not resampled training data). Can be generated by `classifier`.

Value

A list with the following components:

`ae`	Apparent error rate.
`boot`	Average error rate of bootstrap samples(Same as `err`)
`b632`	.632 bootstrap error rate.
`b632p`	.632 plus bootstrap error rate.

Author(s)

Wanchang Lin

References

Witten, I. H. and Frank, E. (2005) Data Mining - Practical Machine Learning and Techniques. Elsevier.

Efron, B. and Tibshirani, R. (1993) An Introduction to the Bootstrap. Chapman & Hall.

Efron, B. and Tibshirani, R. (1997) Improvements on cross-validation: the .632+ bootstrap method. Journal of the American Statistical Association, 92, 548-560.

Examples

## iris data set
data(iris)
x      <- subset(iris, select = -Species)
y      <- iris$Species

## 10 bootstrap training samples 
pars   <- valipars(sampling = "boot", niter = 1, nreps = 10)
tr.idx <- trainind(y, pars=pars)[[1]]

## bootstrap error rate
err <- sapply(tr.idx, function(i){
  pred <- classifier(x[i,,drop = FALSE],y[i],x[-i,,drop = FALSE],y[-i], 
                     method = "knn")$err
})

## average bootstrap error rate
err <- mean(err)

## apparent error rate
resub  <- classifier(x,y,method = "knn")

## 
err.boot <- boot.err(err, resub)

## iris data set
data(iris)
x      <- subset(iris, select = -Species)
y      <- iris$Species

## 10 bootstrap training samples 
pars   <- valipars(sampling = "boot", niter = 1, nreps = 10)
tr.idx <- trainind(y, pars=pars)[[1]]

## bootstrap error rate
err <- sapply(tr.idx, function(i){
  pred <- classifier(x[i,,drop = FALSE],y[i],x[-i,,drop = FALSE],y[-i], 
                     method = "knn")$err
})

## average bootstrap error rate
err <- mean(err)

## apparent error rate
resub  <- classifier(x,y,method = "knn")

## 
err.boot <- boot.err(err, resub)

Boxplot Method for Class 'frankvali'

Description

Boxplot method for error rate of each feature subset.

Usage

## S3 method for class 'frankvali'
boxplot(x,  ...)
## S3 method for class 'frankvali'
boxplot(x,  ...)

Arguments

`x`	An object of class `frankvali`.
`...`	Additional arguments to the plot, such as `main`, `xlab` and `ylab`.

Details

This function is a method for the generic function boxplot() for class frankvali. It plots the error rate of each feature subset.

Value

Returns boxplot of class frankvali.

Author(s)

Wanchang Lin

Examples

data(abr1)
dat <- abr1$pos[,110:500]

x   <- preproc(dat, method="log10")  
y   <- factor(abr1$fact$class)        

dat <- dat.sel(x, y, choices=c("1","2"))
x.1 <- dat[[1]]$dat
y.1 <- dat[[1]]$cls

pars <- valipars(sampling="cv",niter=2,nreps=4)
res  <- frankvali(x.1,y.1,fs.method = "fs.rfe",fs.len = "power2", 
                  cl.method = "knn",pars = pars)
res
summary(res)
boxplot(res)                  
data(abr1)
dat <- abr1$pos[,110:500]

x   <- preproc(dat, method="log10")  
y   <- factor(abr1$fact$class)        

dat <- dat.sel(x, y, choices=c("1","2"))
x.1 <- dat[[1]]$dat
y.1 <- dat[[1]]$cls

pars <- valipars(sampling="cv",niter=2,nreps=4)
res  <- frankvali(x.1,y.1,fs.method = "fs.rfe",fs.len = "power2", 
                  cl.method = "knn",pars = pars)
res
summary(res)
boxplot(res)

Boxplot Method for Class 'maccest'

Description

Boxplot method for the accuracy rate of each classifier.

Usage

## S3 method for class 'maccest'
boxplot(x,  ...)
## S3 method for class 'maccest'
boxplot(x,  ...)

Arguments

`x`	An object of class `maccest`.
`...`	Additional arguments to the plot, such as `main`, `xlab` and `ylab`.

Details

This function is a method for the generic function boxplot() for class maccest. It plots the accurary rate for each classifier.

Value

Returns boxplot of class maccest.

Author(s)

Wanchang Lin

Examples

# Iris data
data(iris)
x      <- subset(iris, select = -Species)
y      <- iris$Species

method <- c("randomForest","svm","knn")
pars   <- valipars(sampling="cv", niter = 2, nreps=5)
tr.idx <- trainind(y, pars=pars)
res    <- maccest(x, y, method=method, pars=pars,
                  comp="anova",kernel="linear") 

res
boxplot(res)
# Iris data
data(iris)
x      <- subset(iris, select = -Species)
y      <- iris$Species

method <- c("randomForest","svm","knn")
pars   <- valipars(sampling="cv", niter = 2, nreps=5)
tr.idx <- trainind(y, pars=pars)
res    <- maccest(x, y, method=method, pars=pars,
                  comp="anova",kernel="linear") 

res
boxplot(res)

Assess Classification Performances

Description

Assess classification performances.

Usage

  cl.rate(obs, pre)
  cl.perf(obs, pre, pos=levels(as.factor(obs))[2])
  cl.roc(stat, label, pos=levels(as.factor(label))[2], plot=TRUE, ...)
  cl.auc(stat, label, pos=levels(as.factor(label))[2])
cl.rate(obs, pre)
  cl.perf(obs, pre, pos=levels(as.factor(obs))[2])
  cl.roc(stat, label, pos=levels(as.factor(label))[2], plot=TRUE, ...)
  cl.auc(stat, label, pos=levels(as.factor(label))[2])

Arguments

`obs`	Factor or vector of observed class.
`pre`	Factor or vector of predicted class.
`stat`	Factor or vector of statistics for positives/cases.
`label`	Factor or vector of label for categorical data.
`pos`	Characteristic string for positive.
`plot`	Logical flag indicating whether ROC should be plotted.
`...`	Further arguments for plotting.

Details

cl.perf gets the classification performances such as accuracy rate and false positive rate. cl.roc computes receiver operating characteristics (ROC). cl.auc calculates area under ROC curve. Three functions are only for binary class problems.

Value

cl.rate returns a list with components:

`acc`	Accuracy rate of classification.
`err`	Error rate of classification.
`con.mat`	Confusion matrix.
`kappa`	Kappa Statistics.

cl.perf returns a list with components:

`acc`	Accuracy rate
`tpr`	True positive rate
`fpr`	False positive rate
`sens`	Sensitivity
`spec`	Specificity
`con.mat`	Confusion matrix.
`kappa`	Kappa Statistics.
`positive`	Positive level.

cl.roc returns a list with components:

`perf`	A data frame of `acc`, `tpr`,`fpr`,`sens`, `spec` and `cutoff` (thresholds).
`auc`	Area under ROC curve
`positive`	Positive level.

cl.auc returns a scalar value of AUC.

Note

AUC varies between 0.5 and 1.0 for sensible models; the higher the better. If it is less than 0.5, it should be corrected by 1 - AUC. Or re-run it by using 1 - stat.

Author(s)

Wanchang Lin

References

Fawcett, F. (2006) An introduction to ROC analysis. Pattern Recognition Letters. vol. 27, 861-874.

Examples


## Measurements of Forensic Glass Fragments
library(MASS)
data(fgl, package = "MASS")    # in MASS package
dat <- subset(fgl, grepl("WinF|WinNF",type))
## dat <- subset(fgl, type %in% c("WinF", "WinNF"))
x   <- subset(dat, select = -type)
y   <- factor(dat$type)

## construct train and test data 
idx   <- sample(1:nrow(x), round((2/3)*nrow(x)), replace = FALSE) 
tr.x  <- x[idx,]
tr.y  <- y[idx]
te.x  <- x[-idx,]        
te.y  <- y[-idx] 

model <- lda(tr.x, tr.y)

## predict the test data results
pred  <- predict(model, te.x)

## classification performances
obs <- te.y
pre <- pred$class   
cl.rate(obs, pre)
cl.perf(obs, pre, pos="WinNF")
## change positive as "WinF"
cl.perf(obs, pre, pos="WinF")

## ROC and AUC
pos  <- "WinNF"            ## or "WinF"
stat <- pred$posterior[,pos]
## levels(obs) <- c(0,1)

cl.auc (stat,obs, pos=pos)
cl.roc (stat,obs, pos=pos)

## test examples for ROC and AUC
label <- rbinom(30,size=1,prob=0.2)
stat  <- rnorm(30)
cl.roc(stat,label, pos=levels(factor(label))[2],plot = TRUE)
cl.auc(stat,label,pos=levels(factor(label))[2])

## if auc is less than 0.5, it should be adjusted by 1 - auc. 
## Or re-run them:
cl.roc(1 - stat,label, pos=levels(factor(label))[2],plot = TRUE)
cl.auc(1 - stat,label,pos=levels(factor(label))[2])

## Measurements of Forensic Glass Fragments
library(MASS)
data(fgl, package = "MASS")    # in MASS package
dat <- subset(fgl, grepl("WinF|WinNF",type))
## dat <- subset(fgl, type %in% c("WinF", "WinNF"))
x   <- subset(dat, select = -type)
y   <- factor(dat$type)

## construct train and test data 
idx   <- sample(1:nrow(x), round((2/3)*nrow(x)), replace = FALSE) 
tr.x  <- x[idx,]
tr.y  <- y[idx]
te.x  <- x[-idx,]        
te.y  <- y[-idx] 

model <- lda(tr.x, tr.y)

## predict the test data results
pred  <- predict(model, te.x)

## classification performances
obs <- te.y
pre <- pred$class   
cl.rate(obs, pre)
cl.perf(obs, pre, pos="WinNF")
## change positive as "WinF"
cl.perf(obs, pre, pos="WinF")

## ROC and AUC
pos  <- "WinNF"            ## or "WinF"
stat <- pred$posterior[,pos]
## levels(obs) <- c(0,1)

cl.auc (stat,obs, pos=pos)
cl.roc (stat,obs, pos=pos)

## test examples for ROC and AUC
label <- rbinom(30,size=1,prob=0.2)
stat  <- rnorm(30)
cl.roc(stat,label, pos=levels(factor(label))[2],plot = TRUE)
cl.auc(stat,label,pos=levels(factor(label))[2])

## if auc is less than 0.5, it should be adjusted by 1 - auc. 
## Or re-run them:
cl.roc(1 - stat,label, pos=levels(factor(label))[2],plot = TRUE)
cl.auc(1 - stat,label,pos=levels(factor(label))[2])

Wrapper Function for Classifiers

Description

Wrapper function for classifiers. The classification model is built up on the training data and error estimation is performed on the test data.

Usage

classifier(dat.tr, cl.tr, dat.te=NULL, cl.te=NULL, method,
           pred.func=predict,...)
classifier(dat.tr, cl.tr, dat.te=NULL, cl.te=NULL, method,
           pred.func=predict,...)

Arguments

`dat.tr`	A data frame or matrix of training data. The classification model are built on it.
`cl.tr`	A factor or vector of training class.
`dat.te`	A data frame or matrix of test data. Error rates are calculated on this data set.
`cl.te`	A factor or vector of test class.
`method`	Classification method to be used. Any classification methods can be employed if they have method `predict` (except `knn`) with output of predicted class label or one component with name of `class` in the returned list, such as `randomForest`, `svm`, `knn` and `lda`. Either a function or a character string naming the function to be called
`pred.func`	Predict method (default is `predict`). Either a function or a character string naming the function to be called.
`...`	Additional parameters to `method`.

Value

A list including components:

`err`	Error rate of test data.
`cl`	The original class of test data.
`pred`	The predicted class of test data.
`posterior`	Posterior probabilities for the classes if `method` provides posterior output.
`acc`	Accuracy rate of classification.
`margin`	The margin of predictions from classifier `method` if it provides posterior output. The margin of a data point is defined as the proportion of probability for the correct class minus maximum proportion of probabilities for the other classes. Positive margin means correct classification, and vice versa. This idea come from package randomForest. For more details, see `margin`.
`auc`	The area under receiver operating curve (AUC) if classifier `method` produces posterior probabilities and the classification is for two-class problem.

Note

The definition of margin is based on the posterior probabilities. Classifiers, such as randomForest, svm, lda, qda, pcalda and plslda, do output posterior probabilities. But knn does not.

Author(s)

Wanchang Lin

Examples

data(abr1)
dat <- preproc(abr1$pos[,110:500], method="log10")  
cls <- factor(abr1$fact$class)        

## tmp <- dat.sel(dat, cls, choices=c("1","2"))
## dat <- tmp[[1]]$dat
## cls <- tmp[[1]]$cls

idx <- sample(1:nrow(dat), round((2/3)*nrow(dat)), replace = FALSE) 
## constrcuct train and test data 
train.dat  <- dat[idx,]
train.cl   <- cls[idx]
test.dat   <- dat[-idx,]       
test.cl    <- cls[-idx] 

## estimates accuracy
res <- classifier(train.dat, train.cl, test.dat, test.cl, 
                  method="randomForest")
res
## get confusion matrix
cl.rate(obs=res$cl, res$pred)   ## same as: cl.rate(obs=test.cl, res$pred)

## Measurements of Forensic Glass Fragments
data(fgl, package = "MASS")    # in MASS package
dat <- subset(fgl, grepl("WinF|WinNF",type))
## dat <- subset(fgl, type %in% c("WinF", "WinNF"))
x   <- subset(dat, select = -type)
y   <- factor(dat$type)

## construct train and test data 
idx   <- sample(1:nrow(x), round((2/3)*nrow(x)), replace = FALSE) 
tr.x  <- x[idx,]
tr.y  <- y[idx]
te.x  <- x[-idx,]        
te.y  <- y[-idx] 

res.1 <- classifier(tr.x, tr.y, te.x, te.y, method="svm")
res.1
cl.rate(obs=res.1$cl, res.1$pred) 

## classification performance for the two-class case.
pos <- "WinF"                              ## select positive level
cl.perf(obs=res.1$cl, pre=res.1$pred, pos=pos)
## ROC and AUC
cl.roc(stat=res.1$posterior[,pos],label=res.1$cl, pos=pos)

data(abr1)
dat <- preproc(abr1$pos[,110:500], method="log10")  
cls <- factor(abr1$fact$class)        

## tmp <- dat.sel(dat, cls, choices=c("1","2"))
## dat <- tmp[[1]]$dat
## cls <- tmp[[1]]$cls

idx <- sample(1:nrow(dat), round((2/3)*nrow(dat)), replace = FALSE) 
## constrcuct train and test data 
train.dat  <- dat[idx,]
train.cl   <- cls[idx]
test.dat   <- dat[-idx,]       
test.cl    <- cls[-idx] 

## estimates accuracy
res <- classifier(train.dat, train.cl, test.dat, test.cl, 
                  method="randomForest")
res
## get confusion matrix
cl.rate(obs=res$cl, res$pred)   ## same as: cl.rate(obs=test.cl, res$pred)

## Measurements of Forensic Glass Fragments
data(fgl, package = "MASS")    # in MASS package
dat <- subset(fgl, grepl("WinF|WinNF",type))
## dat <- subset(fgl, type %in% c("WinF", "WinNF"))
x   <- subset(dat, select = -type)
y   <- factor(dat$type)

## construct train and test data 
idx   <- sample(1:nrow(x), round((2/3)*nrow(x)), replace = FALSE) 
tr.x  <- x[idx,]
tr.y  <- y[idx]
te.x  <- x[-idx,]        
te.y  <- y[-idx] 

res.1 <- classifier(tr.x, tr.y, te.x, te.y, method="svm")
res.1
cl.rate(obs=res.1$cl, res.1$pred) 

## classification performance for the two-class case.
pos <- "WinF"                              ## select positive level
cl.perf(obs=res.1$cl, pre=res.1$pred, pos=pos)
## ROC and AUC
cl.roc(stat=res.1$posterior[,pos],label=res.1$cl, pos=pos)

Correlation Analysis Utilities

Description

Functions to handle correlation analysis on data set.

Usage

cor.cut(mat,cutoff=0.75,abs.f = FALSE,
        use = "pairwise.complete.obs", method = "pearson",...)
  
cor.hcl(mat, cutoff=0.75, use = "pairwise.complete.obs", 
        method = "pearson",fig.f=TRUE, hang=-1,
        horiz = FALSE, main = "Cluster Dendrogram", 
        ylab = ifelse(!horiz, "1 - correlation",""), 
        xlab = ifelse(horiz, "1 - correlation",""),...)
  
cor.heat(mat, use = "pairwise.complete.obs", method = "pearson",
         dend = c("right", "top", "none"),...)
  
corrgram.circle(co, 
                col.regions = colorRampPalette(c("red", "white", "blue")),
                scales = list(x = list(rot = 90)), ...) 

corrgram.ellipse(co,label=FALSE,
                 col.regions = colorRampPalette(c("red", "white", "blue")),
                 scales = list(x = list(rot = 90)), ...) 

cor.heat.gram(mat.1, mat.2, use = "pairwise.complete.obs", 
              method = "pearson", main="Heatmap of correlation", 
              cex=0.75, ...)
              
hm.cols(low = "green", high = "red", n = 123)

cor.cut(mat,cutoff=0.75,abs.f = FALSE,
        use = "pairwise.complete.obs", method = "pearson",...)
  
cor.hcl(mat, cutoff=0.75, use = "pairwise.complete.obs", 
        method = "pearson",fig.f=TRUE, hang=-1,
        horiz = FALSE, main = "Cluster Dendrogram", 
        ylab = ifelse(!horiz, "1 - correlation",""), 
        xlab = ifelse(horiz, "1 - correlation",""),...)
  
cor.heat(mat, use = "pairwise.complete.obs", method = "pearson",
         dend = c("right", "top", "none"),...)
  
corrgram.circle(co, 
                col.regions = colorRampPalette(c("red", "white", "blue")),
                scales = list(x = list(rot = 90)), ...) 

corrgram.ellipse(co,label=FALSE,
                 col.regions = colorRampPalette(c("red", "white", "blue")),
                 scales = list(x = list(rot = 90)), ...) 

cor.heat.gram(mat.1, mat.2, use = "pairwise.complete.obs", 
              method = "pearson", main="Heatmap of correlation", 
              cex=0.75, ...)
              
hm.cols(low = "green", high = "red", n = 123)

Arguments

`mat`, `mat.1`, `mat.2`	A data frame or matrix. It should be noticed that `mat.1` and `mat.2` must have the same number of row.
`cutoff`	A scalar value of threshold.
`abs.f`	Logical flag indicating whether the absolute values should be used.
`fig.f`	Logical flag indicating whether the dendrogram of correlation matrix should be plotted.
`hang`	The fraction of the plot height by which labels should hang below the rest of the plot. A negative value will cause the labels to hang down from 0. See `plot.hclust`.
`horiz`	Logical indicating if the dendrogram should be drawn horizontally or not.
`main`, `xlab`, `ylab`	Graphical parameters, see `plot.default`.
`dend`	Character string indicating whether to draw 'right', 'top' or 'none' dendrograms

`use`	Argument for `cor`. An optional character string giving a method for computing covariances in the presence of missing values. This must be (an abbreviation of) one of the strings `"everything"`, `"all.obs"`, `"complete.obs"`, `"na.or.complete"`, or `"pairwise.complete.obs"`.
`method`	Argument for `cor`. A character string indicating which correlation coefficient (or covariance) is to be computed. One of `"pearson"`, `"kendall"`, or `"spearman"`, can be abbreviated.
`co`	Correlation matrix
`label`	A logical value indicating whether the correlation coefficient should be plotted.
`...`	Additional parameters to lattice.
`col.regions`	Color vector to be used
`scales`	A list determining how the x- and y-axes (tick marks and labels) are drawn. More details, see `xyplot`.
`cex`	A numeric multiplier to control character sizes for axis labels

`low`	Colour for low value
`high`	Colour for high value
`n`	The number of colors (>= 1) to be in the palette

Details

cor.cut returns the pairs with correlation coefficient larger than cutoff.

cor.hcl computes hierarchical cluster analysis based on correlation coefficient. For other graphical parameters, see plot.dendrogram.

cor.heat display correlation heatmap using lattice.

corrgram.circle and corrgram.ellipse display corrgrams with circle and ellipse. The functions are modified from codes given in Deepayan Sarkar's Lattice: Multivariate Data Visualization with R, 13.3.3 Corrgrams as customized level plots, pp:238-241.

cor.heat.gram handles the correlation of two data sets which have the same row number. The best application is correlation between MS data (metabolites) and meta/clinical data.

hm.cols creates a vector of n contiguous colors for heat map.

Value

cor.cut returns a data frame with three columns, in which the first and second columns are variable names and their correlation (lager than cutoff) are given in the third column.

cor.hcl returns a list with components of each cluster group and all correlation coefficients.

cor.heat returns an object of class "trellis".

corrgram.circle returns an object of class "trellis".

corrgram.ellipse returns an object of class "trellis".

cor.heat.gram returns a list including the components:

cor.heat: An object of class "trellis" for correlation heatmap ordered by the hierarchical clustering.
cor.gram: An object of class "trellis" for corrgrams with circle ordered by the hierarchical clustering.
cor.short: A matrix of correlation coefficient in short format.
cor.long: A matrix of correlation coefficient in long format.

Author(s)

Wanchang Lin

References

Michael Friendly (2002). Corrgrams: Exploratory displays for correlation matrices. The American Statistician, 56, 316–324.

D.J. Murdoch, E.D. Chow (1996). A graphical display of large correlation matrices. The American Statistician, 50, 178–180.

Deepayan Sarkar (2008). Lattice: Multivariate Data Visualization with R. Springer.

Examples

data(iris)
cor.cut(iris[,1:4],cutoff=0.8, use="pairwise.complete.obs")
cor.hcl(iris[,1:4],cutoff=0.75,fig.f = TRUE)
ph <- cor.heat(iris[,1:4], dend="top")
ph
update(ph, scales = list(x = list(rot = 45)))

## change heatmap color scheme
cor.heat(iris[,1:4], dend="right", xlab="", ylab="",
  col.regions = colorRampPalette(c("green", "black", "red")))

## or use hm.cols
cor.heat(iris[,1:4], dend="right", xlab="", ylab="", col.regions = hm.cols())

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- preproc(abr1$pos[,110:1930], method="log10")  

## feature selection
res <- fs.rf(dat,cls)
## take top 20 features
fs  <- res$fs.order[1:20]
## construct the data set for correlation analysis
mat <- dat[,fs]

cor.cut(mat,cutoff=0.9)
ch <- cor.hcl(mat,cutoff=0.75,fig.f = TRUE, xlab="Peaks")
## plot dendrogram horizontally with coloured labels.
ch <- cor.hcl(mat,cutoff=0.75,fig.f = TRUE, horiz=TRUE,center=TRUE, 
              nodePar = list(lab.cex = 0.6, lab.col = "forest green", pch = NA),
              xlim=c(2,0))

names(ch)
cor.heat(mat,dend="right")
cor.heat(mat,dend="right",col.regions = colorRampPalette(c("yellow", "red")))

## use corrgram with order by the hierarchical clustering
co  <- cor(mat, use="pairwise.complete.obs")
ord <- order.dendrogram(as.dendrogram(hclust(as.dist(1-co)))) 
corrgram.circle(co[ord,ord], main="Corrgrams with circle")
corrgram.ellipse(co[ord,ord], label = TRUE, main = "Corrgrams with circle",
                 col.regions = hm.cols())

## if without ordering
corrgram.circle(co, main="Corrgrams with circle")

## example of cor.heat.gram 
fs.1  <- res$fs.order[21:50]
mat.1 <- dat[,fs.1]

res.cor <- 
  cor.heat.gram(mat, mat.1, main="Heatmap of correlation between mat.1 and mat.2")
names(res.cor)
res.cor$cor.heat
res.cor$cor.gram

data(iris)
cor.cut(iris[,1:4],cutoff=0.8, use="pairwise.complete.obs")
cor.hcl(iris[,1:4],cutoff=0.75,fig.f = TRUE)
ph <- cor.heat(iris[,1:4], dend="top")
ph
update(ph, scales = list(x = list(rot = 45)))

## change heatmap color scheme
cor.heat(iris[,1:4], dend="right", xlab="", ylab="",
  col.regions = colorRampPalette(c("green", "black", "red")))

## or use hm.cols
cor.heat(iris[,1:4], dend="right", xlab="", ylab="", col.regions = hm.cols())

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- preproc(abr1$pos[,110:1930], method="log10")  

## feature selection
res <- fs.rf(dat,cls)
## take top 20 features
fs  <- res$fs.order[1:20]
## construct the data set for correlation analysis
mat <- dat[,fs]

cor.cut(mat,cutoff=0.9)
ch <- cor.hcl(mat,cutoff=0.75,fig.f = TRUE, xlab="Peaks")
## plot dendrogram horizontally with coloured labels.
ch <- cor.hcl(mat,cutoff=0.75,fig.f = TRUE, horiz=TRUE,center=TRUE, 
              nodePar = list(lab.cex = 0.6, lab.col = "forest green", pch = NA),
              xlim=c(2,0))

names(ch)
cor.heat(mat,dend="right")
cor.heat(mat,dend="right",col.regions = colorRampPalette(c("yellow", "red")))

## use corrgram with order by the hierarchical clustering
co  <- cor(mat, use="pairwise.complete.obs")
ord <- order.dendrogram(as.dendrogram(hclust(as.dist(1-co)))) 
corrgram.circle(co[ord,ord], main="Corrgrams with circle")
corrgram.ellipse(co[ord,ord], label = TRUE, main = "Corrgrams with circle",
                 col.regions = hm.cols())

## if without ordering
corrgram.circle(co, main="Corrgrams with circle")

## example of cor.heat.gram 
fs.1  <- res$fs.order[21:50]
mat.1 <- dat[,fs.1]

res.cor <- 
  cor.heat.gram(mat, mat.1, main="Heatmap of correlation between mat.1 and mat.2")
names(res.cor)
res.cor$cor.heat
res.cor$cor.gram

Generate Pairwise Data Set

Description

Generate index or data set of pairwise combination based on class labels.

Usage

  combn.pw(cls, choices = NULL) 
  
  dat.sel(dat, cls, choices = NULL)
combn.pw(cls, choices = NULL) 
  
  dat.sel(dat, cls, choices = NULL)

Arguments

`dat`	A data frame or matrix of data set.
`cls`	A factor or vector of class labels or categorical data.
`choices`	The vector or list of class labels to be chosen for binary combination.

Details

If choices is NULL, all binary combinations will be computed. If choices has one class label, the comparisons between this one and any other classes will be calculated. If choices has more than two classes, all binary combinations in choices will be generated. For details, see examples below.

Value

combn.pw returns a data frame of index (logical values).

dat.set returns a list of list with components:

`dat`	Pairwise data set.
`cls`	Pairwise class label.

Author(s)

Wanchang Lin

Examples

data(iris)
x <- subset(iris, select = -Species)
y <- iris$Species
 
## generate data set with class "setosa" and "virginica"
binmat.1 <- dat.sel(x,y,choices=c("setosa","virginica"))
names(binmat.1)

## generate data sets for "setosa" vs other classes. These are: 
## "setosa" and "versicolor", "setosa" and "virginica".
binmat.2 <- dat.sel(x,y,choices=c("setosa"))
names(binmat.2)

## generate data set with combination of each class. These are:  
## "setosa" and "versicolor", "setosa" and "virginica",  
## "versicolor" and "virginica" 
binmat.3 <- dat.sel(x,y,choices= NULL)
names(binmat.3)

data(abr1)
cls <- factor(abr1$fact$class)
dat <- preproc(abr1$pos, method="log")

## There are some examples of 'choices'
choices <- c("2")
choices <- c("2","3","4")
choices <- list(c("2","3"),c("4","5"))
choices <- NULL
idx <- combn.pw(cls,choices=choices)  

dat.pw <- dat.sel(dat, cls,choices=choices)

data(iris)
x <- subset(iris, select = -Species)
y <- iris$Species
 
## generate data set with class "setosa" and "virginica"
binmat.1 <- dat.sel(x,y,choices=c("setosa","virginica"))
names(binmat.1)

## generate data sets for "setosa" vs other classes. These are: 
## "setosa" and "versicolor", "setosa" and "virginica".
binmat.2 <- dat.sel(x,y,choices=c("setosa"))
names(binmat.2)

## generate data set with combination of each class. These are:  
## "setosa" and "versicolor", "setosa" and "virginica",  
## "versicolor" and "virginica" 
binmat.3 <- dat.sel(x,y,choices= NULL)
names(binmat.3)

data(abr1)
cls <- factor(abr1$fact$class)
dat <- preproc(abr1$pos, method="log")

## There are some examples of 'choices'
choices <- c("2")
choices <- c("2","3","4")
choices <- list(c("2","3"),c("4","5"))
choices <- NULL
idx <- combn.pw(cls,choices=choices)  

dat.pw <- dat.sel(dat, cls,choices=choices)

Grouped Data Visualisation by PCA, MDS, PCADA and PLSDA

Description

Grouped data visualisation by PCA, MDS, PCADA and PLSDA.

Usage

  pca_plot_wrap(data.list,title="plotting",...) 

  mds_plot_wrap(data.list,method="euclidean",title="plotting",...) 

  lda_plot_wrap(data.list,title="plotting",...) 

  lda_plot_wrap.1(data.list,title="plotting",...) 

  pls_plot_wrap(data.list,title="plotting",...) 
pca_plot_wrap(data.list,title="plotting",...) 

  mds_plot_wrap(data.list,method="euclidean",title="plotting",...) 

  lda_plot_wrap(data.list,title="plotting",...) 

  lda_plot_wrap.1(data.list,title="plotting",...) 

  pls_plot_wrap(data.list,title="plotting",...)

Arguments

`data.list`	A two-layer list structure, in which the second layer include a data frame called `dat` and a factor of class label called `cls`. Noticed that names of the first layer of `data.list` should be given. `data.list` can be produced by `dat.sel`.
`method`	The distance measure to be used. This must be one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". Any unambiguous substring can be given. It is only for `mds_plot_wrap`.
`title`	A part of title string for plotting.
`...`	Further arguments to `lattice`. See corresponding entry in `xyplot` for non-trivial details of `lattice`. One argument is `ep`: an integer flag for ellipse. `1` and `2` for plotting overall and group ellipse, respectively. Otherwise, none. For details, see `panel.elli.1`.

Value

mds_plot_wrap returns a handle for MDS plot.

All other four functions return a list with components: the first one is an object of class "trellis" for data visualisation; the second one is also an object of class "trellis" but plotting the corresponding variables, PCs (principal components), LDs (linear discrimniants) and LCs (latent components); and the third one is a matrix of these variables.

Note

There is a slight differences between lda_plot_wrap.1 and lda_plot_wrap. The former plots the two-class grouped data, which has one linear discriminant (LD1), with strip plot. The later plots the two-class data by LD1 vs LD2 which is identical to LD1. Hence lda_plot_wrap is more general and can be applied to fusion of two and more class data sets.

Author(s)

Wanchang Lin

Examples

data(iris)
x <- subset(iris, select = -Species)
y <- iris$Species
## generate data list by dat.sel
iris.pw <- dat.sel(x,y,choices=NULL)
names(iris.pw)

pca.p <- pca_plot_wrap(iris.pw, ep=2)
pca.p[[1]]     ## visualised by PCA
pca.p[[2]]     ## plot PCA variables
pca.p[[3]]     ## matrix of PCA variables

mds.p <- mds_plot_wrap(iris.pw)
mds.p

pls.p  <- pls_plot_wrap(iris.pw)
pls.p[[1]]
pls.p[[2]]
pls.p[[3]]

lda.p <- lda_plot_wrap.1(iris.pw)
lda.p[[1]]
lda.p[[2]]
lda.p[[3]]
lda_plot_wrap(iris.pw)$lda.p

## only plot iris data
ph <- pca_plot_wrap(list(list(dat=x, cls=y)))$pca.p  
## Not given data names
ph
update(ph, strip=FALSE)       ## strip is an argument of lattice

tmp <- list(iris.dat=list(dat=x, cls=y))
pca_plot_wrap(tmp)$pca.p
pca_plot_wrap(tmp,strip=FALSE)$pca.p
pls_plot_wrap(tmp,strip=FALSE)$pls.p
lda_plot_wrap(tmp,strip=FALSE)$lda.p

data(abr1)
cls <- factor(abr1$fact$class)
dat <- preproc(abr1$pos, method="log")
## pair-wise data set
dat.pw <- dat.sel(dat, cls,choices=c("2","3","4"))

## add mult-class
idx <- grep("2|3|4",cls)
cls.234 <- factor(cls[idx])
dat.234 <- dat[idx,,drop = FALSE]

## combine all
dat.tmp <- c(dat.pw, 
             "2~3~4"=list(list(dat=dat.234,cls=cls.234)),
             all=list(list(dat=dat, cls=cls)))

## PCA
ph <- pca_plot_wrap(dat.tmp, title="abr1", par.strip.text = list(cex=0.75), 
                     scales=list(cex =.75,relation="free"), ep=2) 
## See function grpplot for usage of ep.
ph[[1]]
ph[[2]]                     

##PLSDA
ph <- pls_plot_wrap(dat.tmp, title="abr1", par.strip.text = list(cex=0.75), 
                     scales=list(cex =.75,relation="free"), ep=2) 
ph[[1]]
ph[[2]]                     

## PCADA
ph <- lda_plot_wrap(dat.tmp, title="abr1", par.strip.text = list(cex=0.75), 
                     scales=list(cex =.75,relation="free")) 
ph[[1]]
ph[[2]]    
data(iris)
x <- subset(iris, select = -Species)
y <- iris$Species
## generate data list by dat.sel
iris.pw <- dat.sel(x,y,choices=NULL)
names(iris.pw)

pca.p <- pca_plot_wrap(iris.pw, ep=2)
pca.p[[1]]     ## visualised by PCA
pca.p[[2]]     ## plot PCA variables
pca.p[[3]]     ## matrix of PCA variables

mds.p <- mds_plot_wrap(iris.pw)
mds.p

pls.p  <- pls_plot_wrap(iris.pw)
pls.p[[1]]
pls.p[[2]]
pls.p[[3]]

lda.p <- lda_plot_wrap.1(iris.pw)
lda.p[[1]]
lda.p[[2]]
lda.p[[3]]
lda_plot_wrap(iris.pw)$lda.p

## only plot iris data
ph <- pca_plot_wrap(list(list(dat=x, cls=y)))$pca.p  
## Not given data names
ph
update(ph, strip=FALSE)       ## strip is an argument of lattice

tmp <- list(iris.dat=list(dat=x, cls=y))
pca_plot_wrap(tmp)$pca.p
pca_plot_wrap(tmp,strip=FALSE)$pca.p
pls_plot_wrap(tmp,strip=FALSE)$pls.p
lda_plot_wrap(tmp,strip=FALSE)$lda.p

data(abr1)
cls <- factor(abr1$fact$class)
dat <- preproc(abr1$pos, method="log")
## pair-wise data set
dat.pw <- dat.sel(dat, cls,choices=c("2","3","4"))

## add mult-class
idx <- grep("2|3|4",cls)
cls.234 <- factor(cls[idx])
dat.234 <- dat[idx,,drop = FALSE]

## combine all
dat.tmp <- c(dat.pw, 
             "2~3~4"=list(list(dat=dat.234,cls=cls.234)),
             all=list(list(dat=dat, cls=cls)))

## PCA
ph <- pca_plot_wrap(dat.tmp, title="abr1", par.strip.text = list(cex=0.75), 
                     scales=list(cex =.75,relation="free"), ep=2) 
## See function grpplot for usage of ep.
ph[[1]]
ph[[2]]                     

##PLSDA
ph <- pls_plot_wrap(dat.tmp, title="abr1", par.strip.text = list(cex=0.75), 
                     scales=list(cex =.75,relation="free"), ep=2) 
ph[[1]]
ph[[2]]                     

## PCADA
ph <- lda_plot_wrap(dat.tmp, title="abr1", par.strip.text = list(cex=0.75), 
                     scales=list(cex =.75,relation="free")) 
ph[[1]]
ph[[2]]

Summary Utilities

Description

Functions to summarise data.

Usage

df.summ(dat, method=vec.summ,...)
  
vec.summ(x)

vec.summ.1(x)
df.summ(dat, method=vec.summ,...)
  
vec.summ(x)

vec.summ.1(x)

Arguments

`dat`	A data frame or matrix of data set.
`x`	A vector value.
`method`	Summary method such as `vec.summ` and `vec.summ.1`. For user-defined methods, see examples below.
`...`	Additional parameters to `method` function.

Value

df.summ returns a summarised data frame.

vec.summ returns an vector of number of variables (exclusing NAs), minimum, mean, median, maximum and standard derivation.

vec.summ.1 returns an vector of number of variables (exclusing NAs), mean, median, 95% confidence interval of median, IQR and standard derivation.

Author(s)

Wanchang Lin

Examples

data(abr1)
dat <- (abr1$pos)[,110:150]
cls <- factor(abr1$fact$class)

## sort out missing value
dat <- mv.zene(dat)

## summary of an individual column
vec.summ(dat[,2])
vec.summ.1(dat[,2])

## summary of data frame
summ   <- df.summ(dat)                       ## default: vec.summ
summ.1 <- df.summ(dat, method=vec.summ.1)

## summary by groups
by(dat, list(cls=cls), df.summ)

## User-defined summary function: 
vec.segment <- function(x, bar=c("SD", "SE", "CI"))
{  
  bar <- match.arg(bar)

  centre <- mean(x, na.rm = TRUE)

  if (bar == "SD") {
    stderr <- sd(x, na.rm = TRUE)        ## Standard derivation (SD)
    lower  <- centre - stderr
    upper  <- centre + stderr
  } else if (bar == "SE") {      ## Standard error(SE) of mean
    stderr <- sd(x, na.rm = TRUE)/sqrt(sum(!is.na(x)))
    ## stderr <- sqrt(var(x, na.rm = TRUE)/length(x[complete.cases(x)]))
    lower  <- centre - stderr
    upper  <- centre + stderr
  } else if (bar == "CI") {      ## Confidence interval (CI), here 95%.
    conf   <- t.test(x)$conf.int
    lower  <- conf[1]
    upper  <- conf[2]
  } else {
    stop("'method' invalid")
  }

  res <- c(lower=lower, centre=centre, upper=upper)
  return(res)
}

## test it
vec.segment(dat[,2])
summ.2 <- df.summ(dat, method=vec.segment, bar="SE")

## ----------------------------------------------------------
#' iris data
df.summ(iris)

#' Group summary
## library(plyr)
## ddply(iris, .(Species), df.summ)
## (tmp <- dlply(iris, .(Species), df.summ, method=vec.segment))
##do.call("rbind", tmp)

#' or you can use summarise to get the group summary for single variable:
## ddply(iris, .(Species), summarise, 
##      mean=mean(Sepal.Length), std=sd(Sepal.Length))

data(abr1)
dat <- (abr1$pos)[,110:150]
cls <- factor(abr1$fact$class)

## sort out missing value
dat <- mv.zene(dat)

## summary of an individual column
vec.summ(dat[,2])
vec.summ.1(dat[,2])

## summary of data frame
summ   <- df.summ(dat)                       ## default: vec.summ
summ.1 <- df.summ(dat, method=vec.summ.1)

## summary by groups
by(dat, list(cls=cls), df.summ)

## User-defined summary function: 
vec.segment <- function(x, bar=c("SD", "SE", "CI"))
{  
  bar <- match.arg(bar)

  centre <- mean(x, na.rm = TRUE)

  if (bar == "SD") {
    stderr <- sd(x, na.rm = TRUE)        ## Standard derivation (SD)
    lower  <- centre - stderr
    upper  <- centre + stderr
  } else if (bar == "SE") {      ## Standard error(SE) of mean
    stderr <- sd(x, na.rm = TRUE)/sqrt(sum(!is.na(x)))
    ## stderr <- sqrt(var(x, na.rm = TRUE)/length(x[complete.cases(x)]))
    lower  <- centre - stderr
    upper  <- centre + stderr
  } else if (bar == "CI") {      ## Confidence interval (CI), here 95%.
    conf   <- t.test(x)$conf.int
    lower  <- conf[1]
    upper  <- conf[2]
  } else {
    stop("'method' invalid")
  }

  res <- c(lower=lower, centre=centre, upper=upper)
  return(res)
}

## test it
vec.segment(dat[,2])
summ.2 <- df.summ(dat, method=vec.segment, bar="SE")

## ----------------------------------------------------------
#' iris data
df.summ(iris)

#' Group summary
## library(plyr)
## ddply(iris, .(Species), df.summ)
## (tmp <- dlply(iris, .(Species), df.summ, method=vec.segment))
##do.call("rbind", tmp)

#' or you can use summarise to get the group summary for single variable:
## ddply(iris, .(Species), summarise, 
##      mean=mean(Sepal.Length), std=sd(Sepal.Length))

Rank aggregation by Borda count algorithm

Description

Use Borda count to get the final feature order.

Usage

 feat.agg(fs.rank.list) feat.agg(fs.rank.list)

Arguments

fs.rank.list

A data frame of feature orders by different feature selectors.

Value

A list with components:

`fs.order`	Final feature order.
`fs.rank`	Aggregated rank list by Borda count.

Author(s)

Wanchang Lin

Examples

data(abr1)
dat <- preproc(abr1$pos[,200:400], method="log10")  
cls <- factor(abr1$fact$class)

## feature selection without resampling
fs <- feat.mfs(dat, cls, method=c("fs.anova","fs.rf","fs.rfe"), 
               is.resam=FALSE)
## rank aggregation 
fs.1 <- feat.agg(fs$fs.rank)
names(fs.1)
data(abr1)
dat <- preproc(abr1$pos[,200:400], method="log10")  
cls <- factor(abr1$fact$class)

## feature selection without resampling
fs <- feat.mfs(dat, cls, method=c("fs.anova","fs.rf","fs.rfe"), 
               is.resam=FALSE)
## rank aggregation 
fs.1 <- feat.agg(fs$fs.rank)
names(fs.1)

Frequency and Stability of Feature Selection

Description

Frequency and stability of feature selection.

Usage

  feat.freq(x,rank.cutoff=50,freq.cutoff=0.5)
feat.freq(x,rank.cutoff=50,freq.cutoff=0.5)

Arguments

`x`	A matrix or data frame of feature orders.
`rank.cutoff`	A numeric value for cutoff of top features.
`freq.cutoff`	A numeric value for cutoff of feature frequency.

Value

A list with components:

`freq.all`	Feature frequencies.
`freq`	Feature frequencies larger than `freq.cutoff`.
`stability`	Stability rate of feature ranking.
`rank.cutoff`	Top feature order cut-off used.
`freq.cutoff`	Feature frequency cut-off used.

Author(s)

Wanchang Lin

References

Davis, C. A., et al., (2006) Reliable gene signatures for microarray classification: assessment of stability and performance. Bioinformatics, vol.22, no.19, 2356 - 2363.

Michiels, S., et al., (2005) Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet, vol.365, 488 - 492.

Examples

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## use resampling method of bootstrap 
pars   <- valipars(sampling="boot",niter=10, nreps=5)
z      <- feat.rank.re(mat,grp,method="fs.plsvip",pars = pars)

## compute the frequency and stability of feature selection 
freq <- feat.freq(z$order.list,rank.cutoff=50,freq.cutoff=0.5)

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## use resampling method of bootstrap 
pars   <- valipars(sampling="boot",niter=10, nreps=5)
z      <- feat.rank.re(mat,grp,method="fs.plsvip",pars = pars)

## compute the frequency and stability of feature selection 
freq <- feat.freq(z$order.list,rank.cutoff=50,freq.cutoff=0.5)

Multiple Feature Selection

Description

Multiple feature selection with or without resampling procedures.

Usage

feat.mfs(x,y,method,pars = valipars(),is.resam = TRUE, ...)
         
feat.mfs.stab(fs.res,rank.cutoff = 20,freq.cutoff = 0.5)

feat.mfs.stats(fs.stats,cumu.plot=FALSE, main="Stats Plot", 
               ylab="Values", xlab="Index of variable", ...)

feat.mfs(x,y,method,pars = valipars(),is.resam = TRUE, ...)
         
feat.mfs.stab(fs.res,rank.cutoff = 20,freq.cutoff = 0.5)

feat.mfs.stats(fs.stats,cumu.plot=FALSE, main="Stats Plot", 
               ylab="Values", xlab="Index of variable", ...)

Arguments

`x`	A matrix or data frame containing the explanatory variables.
`y`	A factor specifying the class for each observation.
`method`	Multiple feature selection/ranking method to be used.
`pars`	A list of resampling scheme. See `valipars` for details.
`is.resam`	A logical value indicating whether the resampling should be applied.
`fs.res`	A list obtained by running `feat.mfs` .
`rank.cutoff`	Cutoff of top features for frequency calculating.
`freq.cutoff`	Cutoff of feature frequency.
`fs.stats`	A matrix of feature statistics or values outputted by `feat.mfs`
`cumu.plot`	A logical value indicating the cumulative scores should be plotted.
`main`, `xlab`, `ylab`	Plot parameters
`...`	Additional parameters.

Details

feat.mfs.stab summarises multiple feature selection only when resampling strategy is employed (i.e. is.resam is TRUE when calling feat.mfs). It obtains these results based on feat.mfs's returned value called all.

feat.mfs.stats handles the statistical values or scores. Its purpose is to provide a guidance in selecting the best number of features by spotting the elbow point. This method should work in conjunction with plotting of p-values and their corresponding adjusted values such as FDR and Bonferroni in the multiple hypothesis test.

Value

feat.mfs returns a list with components:

`fs.order`	A data frame of feature order from best to worst.
`fs.rank`	A matrix of feature ranking scores.
`fs.stats`	A matrix of feature statistics or values.
`all`	A list of output of `feat.rank.re` for each feature selection method.

feat.mfs.stab returns a list with components:

`fs.freq`	Feature frequencies larger than `freq.cutoff`.
`fs.subs`	Feature with frequencies larger than `freq.cutoff`.
`fs.stab`	Stability rate of feature ranking.
`fs.cons`	A matrix of feature consensus table based on feature frequency.

feat.mfs.stats returns a list with components:

`stats.tab`	A statistical values with their corresponding names.
`stats.long`	Long-format of statistical values for plotting.
`stats.p`	An object of class "trellis".

Note

The feature order can be computed directly from the overall statistics fs.stats. It is, however, slightly different from fs.order obtained by rank aggregation when resampling is employed.

The fs.cons and fs.freq are computed based on fs.order.

Author(s)

Wanchang Lin

Examples

## Not run: 
library(lattice)	
data(abr1)
dat <- preproc(abr1$pos[,200:400], method="log10")  
cls <- factor(abr1$fact$class)

tmp <- dat.sel(dat, cls, choices=c("1","2"))
x   <- tmp[[1]]$dat
y   <- tmp[[1]]$cls

fs.method <- c("fs.anova","fs.rf","fs.rfe")
fs.pars   <- valipars(sampling="cv",niter=10,nreps=5)
fs <- feat.mfs(x, y, fs.method, fs.pars)   ## with resampling
names(fs)

## frequency, consensus and stabilities of feature selection 
fs.stab <- feat.mfs.stab(fs)
print(fs.stab$fs.cons,digits=2,na.print="")

## plot feature selection frequency
freq <- fs.stab$fs.freq
dotplot(freq$fs.anova, type="o", main="Feature Selection Frequencies")
barchart(freq$fs.anova)

## rank aggregation 
fs.agg <- feat.agg(fs$fs.rank)

## stats table and plotting
fs.stats <- fs$fs.stats
tmp <- feat.mfs.stats(fs.stats, cumu.plot = TRUE)
tmp$stats.p
fs.tab <- tmp$stats.tab
## convert to matrix
fs.tab <- list2df(un.list(fs.tab))

## without resampling
fs.1 <- feat.mfs(x, y, method=fs.method, is.resam = FALSE)

## End(Not run)
## Not run: 
library(lattice)	
data(abr1)
dat <- preproc(abr1$pos[,200:400], method="log10")  
cls <- factor(abr1$fact$class)

tmp <- dat.sel(dat, cls, choices=c("1","2"))
x   <- tmp[[1]]$dat
y   <- tmp[[1]]$cls

fs.method <- c("fs.anova","fs.rf","fs.rfe")
fs.pars   <- valipars(sampling="cv",niter=10,nreps=5)
fs <- feat.mfs(x, y, fs.method, fs.pars)   ## with resampling
names(fs)

## frequency, consensus and stabilities of feature selection 
fs.stab <- feat.mfs.stab(fs)
print(fs.stab$fs.cons,digits=2,na.print="")

## plot feature selection frequency
freq <- fs.stab$fs.freq
dotplot(freq$fs.anova, type="o", main="Feature Selection Frequencies")
barchart(freq$fs.anova)

## rank aggregation 
fs.agg <- feat.agg(fs$fs.rank)

## stats table and plotting
fs.stats <- fs$fs.stats
tmp <- feat.mfs.stats(fs.stats, cumu.plot = TRUE)
tmp$stats.p
fs.tab <- tmp$stats.tab
## convert to matrix
fs.tab <- list2df(un.list(fs.tab))

## without resampling
fs.1 <- feat.mfs(x, y, method=fs.method, is.resam = FALSE)

## End(Not run)

Feature Ranking with Resampling Method

Description

Feature selection with resampling method.

Usage

  feat.rank.re(x,y,method=,pars = valipars(),tr.idx=NULL,...)
feat.rank.re(x,y,method=,pars = valipars(),tr.idx=NULL,...)

Arguments

`x`	A matrix or data frame containing the explanatory variables.
`y`	A factor specifying the class for each observation.
`method`	Feature selection method to be used. For each method used in this function, the output must be a list including two components, `fs.rank` (rank scores of features) and `fs.order`(feature orders in descending order).
`pars`	A list of resampling scheme method such as Leave-one-out cross-validation, Cross-validation, Bootstrap and Randomised validation (holdout). See `valipars` for details.
`tr.idx`	User defined index of training samples. Can be generated by `trainind`.
`...`	Additional parameters to `method`.

Value

A list with components:

`method`	Feature selection method used.
`fs.rank`	A vector of final feature ranking list.
`fs.order`	A vector of final feature order from best to worst.
`rank.list`	Feature rank lists of all computation.
`order.list`	Feature order lists of all computation.
`pars`	Resampling parameters.
`tr.idx`	Index of training samples.
`all`	All results come from re-sampling.

Author(s)

Wanchang Lin

Examples

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
## mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- which(cls==1 | cls==2)
x   <- dat[ind,,drop=FALSE] 
y   <- cls[ind, drop=TRUE]   

## feature selection
pars   <- valipars(sampling="boot",niter=2,nreps=5)
tr.idx <- trainind(y,pars=pars)

z      <- feat.rank.re(x,y,method="fs.auc",pars = pars)
names(z)
               
## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
## mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- which(cls==1 | cls==2)
x   <- dat[ind,,drop=FALSE] 
y   <- cls[ind, drop=TRUE]   

## feature selection
pars   <- valipars(sampling="boot",niter=2,nreps=5)
tr.idx <- trainind(y,pars=pars)

z      <- feat.rank.re(x,y,method="fs.auc",pars = pars)
names(z)

Feature Ranking and Validation on Feature Subset

Description

Get feature ranking on the training data and validate selected feature subsets by estimating their classification error rate.

Usage

frank.err(dat.tr, cl.tr, dat.te, cl.te, cl.method="svm",
          fs.method="fs.auc", fs.order=NULL, fs.len="power2", ...)
frank.err(dat.tr, cl.tr, dat.te, cl.te, cl.method="svm",
          fs.method="fs.auc", fs.order=NULL, fs.len="power2", ...)

Arguments

`dat.tr`	A data frame or matrix of training data. Feature ranking and classification model are carried on this data set.
`cl.tr`	A factor or vector of training class.
`dat.te`	A data frame or matrix of test data. Error rates are calculated on this data set.
`cl.te`	A factor or vector of test class.
`cl.method`	Classification method to be used. Any classification methods can be employed if they have method `predict` (except `knn`) with output of predicted class label or one component with name of `class` in the returned list, such as `randomForest`, `svm`, `knn` and `lda`.
`fs.method`	Feature ranking method. If `fs.order` is not `NULL`, it is ignored.
`fs.order`	A vector of feature order. Default is `NULL` and then the feature selection will be performed on the training data.
`fs.len`	The lengths of feature subsets used for validation. For details, see `get.fs.len`.
`...`	Additional parameters to `fs.method` or `cl.method`.

Value

A list with components:

`cl.method`	Classification method used.
`fs.len`	The lengths of feature subsets used for validation.
`error`	Error rate for each feature length.
`fs.method`	Feature ranking method used.
`fs.order`	Feature order vector.
`fs.rank`	Feature ranking score vector.

Author(s)

Wanchang Lin

Examples

data(abr1)
dat <- abr1$pos
x   <- preproc(dat[,110:500], method="log10")  
y   <- factor(abr1$fact$class)        

dat <- dat.sel(x, y, choices=c("1","6"))
x.1 <- dat[[1]]$dat
y.1 <- dat[[1]]$cls

idx <- sample(1:nrow(x.1), round((2/3)*nrow(x.1)), replace=FALSE) 
## construct train and test data 
train.dat  <- x.1[idx,]
train.cl   <- y.1[idx]
test.dat   <- x.1[-idx,]   
test.cl    <- y.1[-idx] 

## validate feature selection on some feature subsets
res <- frank.err(train.dat, train.cl, test.dat, test.cl, 
                 cl.method="knn", fs.method="fs.auc",  
                 fs.len="power2")
names(res)
## full feature order list
res$fs.order

## validation on subsets of feature order 
res$error

## or first apply feature selection
fs <- fs.auc(train.dat,train.cl)
## then apply error estimation for each selected feature subset
res.1 <- frank.err(train.dat, train.cl, test.dat, test.cl, 
                   cl.method="knn", fs.order=fs$fs.order,  
                   fs.len="power2")

res.1$error

data(abr1)
dat <- abr1$pos
x   <- preproc(dat[,110:500], method="log10")  
y   <- factor(abr1$fact$class)        

dat <- dat.sel(x, y, choices=c("1","6"))
x.1 <- dat[[1]]$dat
y.1 <- dat[[1]]$cls

idx <- sample(1:nrow(x.1), round((2/3)*nrow(x.1)), replace=FALSE) 
## construct train and test data 
train.dat  <- x.1[idx,]
train.cl   <- y.1[idx]
test.dat   <- x.1[-idx,]   
test.cl    <- y.1[-idx] 

## validate feature selection on some feature subsets
res <- frank.err(train.dat, train.cl, test.dat, test.cl, 
                 cl.method="knn", fs.method="fs.auc",  
                 fs.len="power2")
names(res)
## full feature order list
res$fs.order

## validation on subsets of feature order 
res$error

## or first apply feature selection
fs <- fs.auc(train.dat,train.cl)
## then apply error estimation for each selected feature subset
res.1 <- frank.err(train.dat, train.cl, test.dat, test.cl, 
                   cl.method="knn", fs.order=fs$fs.order,  
                   fs.len="power2")

res.1$error

Estimates Feature Ranking Error Rate with Resampling

Description

Estimates error rate of feature ranking with resampling methods.

Usage

frankvali(dat, ...)
## Default S3 method:
frankvali(dat,cl,cl.method = "svm", fs.method="fs.auc",
          fs.order=NULL, fs.len="power2", pars = valipars(),
          tr.idx=NULL,...)

## S3 method for class 'formula'
frankvali(formula, data = NULL, ..., subset, na.action = na.omit)

fs.cl(dat,cl,fs.order=colnames(dat), fs.len=1:ncol(dat), 
      cl.method = "svm", pars = valipars(), all.fs=FALSE, ...)
        
fs.cl.1(dat,cl,fs.order=colnames(dat), cl.method = "svm", 
        pars = valipars(), agg_f=FALSE,...)

frankvali(dat, ...)
## Default S3 method:
frankvali(dat,cl,cl.method = "svm", fs.method="fs.auc",
          fs.order=NULL, fs.len="power2", pars = valipars(),
          tr.idx=NULL,...)

## S3 method for class 'formula'
frankvali(formula, data = NULL, ..., subset, na.action = na.omit)

fs.cl(dat,cl,fs.order=colnames(dat), fs.len=1:ncol(dat), 
      cl.method = "svm", pars = valipars(), all.fs=FALSE, ...)
        
fs.cl.1(dat,cl,fs.order=colnames(dat), cl.method = "svm", 
        pars = valipars(), agg_f=FALSE,...)

Arguments

`formula`	A formula of the form `groups ~ x1 + x2 + ...` That is, the response is the grouping factor and the right hand side specifies the (non-factor) discriminators.
`data`	Data frame from which variables specified in `formula` are preferentially to be taken.
`dat`	A matrix or data frame containing the explanatory variables if no formula is given as the principal argument.
`cl`	A factor specifying the class for each observation if no formula principal argument is given.
`cl.method`	Classification method to be used. Any classification methods can be employed if they have method `predict` (except `knn`) with output of predicted class label or one component with name of `class` in the returned list, such as `randomForest`, `svm`, `knn` and `lda`.
`fs.method`	Feature ranking method to be used. If `fs.order` is not `NULL`, it will be overridden.
`fs.order`	A vector of ordered feature order. In `frankvali` its default is `NULL` and then the feature selection will be performed on the training data.
`fs.len`	Feature length used for validation. For details, see `get.fs.len`.
`pars`	A list of resampling scheme method such as Cross-validation, Stratified cross-validation, Leave-one-out cross-validation, Randomised validation (holdout), Bootstrap, .632 bootstrap and .632 plus bootstrap, and control parameters for the calculation of accuracy. See `valipars` for details.
`tr.idx`	User defined index of training samples. Can be generated by `trainind`.
`all.fs`	A logical value indicating whether all features should be used for evaluation.
`agg_f`	A logical value indicating whether aggregated features should be used for evaluation.
`...`	Additional parameters to `fs.method` or `cl.method`.
`subset`	Optional vector, specifying a subset of observations to be used.
`na.action`	Function which indicates what should happen when the data contains `NA`'s, defaults to `na.omit`.

Details

These functions validate the selected feature subsets by classification and resampling methods.

It can take any classification model if its argument format is model(formula, data, subset, ...) and their corresponding method predict.model(object, newdata, ...) can either return the only predicted class label or in a list with name as class, such as lda and pcalda.

The resampling method can be one of cv, scv, loocv, boot, 632b and 632pb.

The feature ranking method can take one of fs.rf, fs.auc, fs.welch, fs.anova, fs.bw, fs.snr, fs.kruskal, fs.relief and fs.rfe.

Value

frankvali returns an object of class including the components:

`fs.method`	Feature ranking method used.
`cl.method`	Classification method used.
`fs.len`	Feature lengths used.
`fs.rank`	Final feature ranking. It is obtained based on `fs.list` by Borda vote method.
`err.all`	Error rate for all computation.
`err.iter`	Error rate for each iteration.
`err.avg`	Average error rate for all iterations.
`sampling`	Sampling scheme used.
`niter`	Number of iterations.
`nboot`	Number of bootstrap replications if the sampling method is one of `boot`, `632b` and `632pb`.
`nfold`	Fold of cross-validations if the sampling is `cv` or `scv`.
`nrand`	Number of replications if the sampling is `random`.
`fs.list`	Feature list of all computation if `fs.order` is `NULL`.

fs.cl and fs.cl.1 return a matrix with columns of acc (accuracy), auc(area under ROC curve) and mar(class margin).

Note

fs.cl is the simplified version of frankvali. Both frankvali and fs.cl are used for validation of aggregated features from top to bottom only, but fs.cl.1 can be used for validation of either individual or aggregated features.

Author(s)

Wanchang Lin

Examples

data(abr1)
dat <- abr1$pos
x   <- preproc(dat[,110:500], method="log10")  
y   <- factor(abr1$fact$class)        

dat <- dat.sel(x, y, choices=c("1","2"))
x.1 <- dat[[1]]$dat
y.1 <- dat[[1]]$cls

len  <- c(1:20,seq(25,50,5),seq(60,90,10),seq(100,300,50))
pars <- valipars(sampling="boot",niter=2, nreps=4)
res  <- frankvali(x.1,y.1,cl.method = "knn", fs.method="fs.auc",
                  fs.len=len, pars = pars)
res
summary(res)
boxplot(res)  

## Not run: 
## or apply feature selection with re-sampling procedure at first
fs  <- feat.rank.re(x.1,y.1,method="fs.auc",pars = pars)

## then estimate error of feature selection.
res.1  <- frankvali(x.1,y.1,cl.method = "knn", fs.order=fs$fs.order,
                    fs.len=len, pars = pars)
res.1

## use formula
data.bin <- data.frame(y.1,x.1)

pars <- valipars(sampling="cv",niter=2,nreps=4)
res.2  <- frankvali(y.1~., data=data.bin,fs.method="fs.rfe",fs.len=len, 
                    cl.method = "knn",pars = pars)
res.2

## examples of fs.cl and fs.cl.1
fs <- fs.rf(x.1, y.1)
res.3 <- fs.cl(x.1,y.1,fs.order=fs$fs.order, fs.len=len,
               cl.method = "svm", pars = pars, all.fs=TRUE)

ord <- fs$fs.order[1:50]
## aggregated features
res.4 <- fs.cl.1(x.1,y.1,fs.order=ord, cl.method = "svm", pars = pars,
                 agg_f=TRUE)
               
## individual feature
res.5 <- fs.cl.1(x.1,y.1,fs.order=ord, cl.method = "svm", pars = pars,
                 agg_f=FALSE)
                 

## End(Not run)
data(abr1)
dat <- abr1$pos
x   <- preproc(dat[,110:500], method="log10")  
y   <- factor(abr1$fact$class)        

dat <- dat.sel(x, y, choices=c("1","2"))
x.1 <- dat[[1]]$dat
y.1 <- dat[[1]]$cls

len  <- c(1:20,seq(25,50,5),seq(60,90,10),seq(100,300,50))
pars <- valipars(sampling="boot",niter=2, nreps=4)
res  <- frankvali(x.1,y.1,cl.method = "knn", fs.method="fs.auc",
                  fs.len=len, pars = pars)
res
summary(res)
boxplot(res)  

## Not run: 
## or apply feature selection with re-sampling procedure at first
fs  <- feat.rank.re(x.1,y.1,method="fs.auc",pars = pars)

## then estimate error of feature selection.
res.1  <- frankvali(x.1,y.1,cl.method = "knn", fs.order=fs$fs.order,
                    fs.len=len, pars = pars)
res.1

## use formula
data.bin <- data.frame(y.1,x.1)

pars <- valipars(sampling="cv",niter=2,nreps=4)
res.2  <- frankvali(y.1~., data=data.bin,fs.method="fs.rfe",fs.len=len, 
                    cl.method = "knn",pars = pars)
res.2

## examples of fs.cl and fs.cl.1
fs <- fs.rf(x.1, y.1)
res.3 <- fs.cl(x.1,y.1,fs.order=fs$fs.order, fs.len=len,
               cl.method = "svm", pars = pars, all.fs=TRUE)

ord <- fs$fs.order[1:50]
## aggregated features
res.4 <- fs.cl.1(x.1,y.1,fs.order=ord, cl.method = "svm", pars = pars,
                 agg_f=TRUE)
               
## individual feature
res.5 <- fs.cl.1(x.1,y.1,fs.order=ord, cl.method = "svm", pars = pars,
                 agg_f=FALSE)
                 

## End(Not run)

Feature Selection Using ANOVA

Description

Feature selection using ANOVA.

Usage

  fs.anova(x,y,...)
fs.anova(x,y,...)

Arguments

`x`	A data frame or matrix of data set.
`y`	A factor or vector of class.
`...`	Arguments to pass to method.

Value

A list with components:

`fs.rank`	A vector of feature ranking scores.
`fs.order`	A vector of feature order from best to worst.
`stats`	A vector of statistics.
`pval`	A vector of p values.

Author(s)

Wanchang Lin

Examples

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## apply ANOVA method for feature selection/ranking
res <- fs.anova(mat,grp)
names(res)

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## apply ANOVA method for feature selection/ranking
res <- fs.anova(mat,grp)
names(res)

Feature Selection Using Area under Receiver Operating Curve (AUC)

Description

Feature selection using area under receiver operating curve (AUC).

Usage

  fs.auc(x,y,...)
fs.auc(x,y,...)

Arguments

`x`	A data frame or matrix of data set.
`y`	A factor or vector of class.
`...`	Arguments to pass(current ignored).

Value

A list with components:

`fs.rank`	A vector of feature ranking scores.
`fs.order`	A vector of feature order from best to worst.
`stats`	A vector of measurements.

Note

This function is for two-class problem only.

Author(s)

Wanchang Lin

Examples

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## apply AUC method for feature selection/ranking
res <- fs.auc(mat,grp)
names(res)

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## apply AUC method for feature selection/ranking
res <- fs.auc(mat,grp)
names(res)

Feature Selection Using Between-Group to Within-Group (BW) Ratio

Description

Feature selection using ratio of between-group to within-group sums of squares (BW).

Usage

  fs.bw(x,y,...)
fs.bw(x,y,...)

Arguments

`x`	A data frame or matrix of data set.
`y`	A factor or vector of class.
`...`	Arguments to pass(current ignored).

Value

A list with components:

`fs.rank`	A vector of feature ranking scores.
`fs.order`	A vector of feature order from best to worst.
`stats`	A vector of measurements.

Author(s)

Wanchang Lin

References

Dudoit, S., Fridlyand, J. and Speed, T.P. Comparison of discrimination methods for classification of tumours using gene expression data. Journal of the American Statistical Association. Vol.97, No.457, 77-87.

Examples

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## apply BW ratio method for feature selection/ranking
res <- fs.bw(mat,grp)
names(res)

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## apply BW ratio method for feature selection/ranking
res <- fs.bw(mat,grp)
names(res)

Feature Selection Using Kruskal-Wallis Test

Description

Feature selection using Kruskal-Wallis test.

Usage

  fs.kruskal(x,y,...)
fs.kruskal(x,y,...)

Arguments

`x`	A data frame or matrix of data set.
`y`	A factor or vector of class.
`...`	Arguments to pass to method.

Value

A list with components:

`fs.rank`	A vector of feature ranking scores.
`fs.order`	A vector of feature order from best to worst.
`stats`	A vector of statistics.
`pval`	A vector of p values.

Author(s)

Wanchang Lin

Examples

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## apply Kruskal-Wallis test method for feature selection/ranking
res <- fs.kruskal(mat,grp)
names(res)

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## apply Kruskal-Wallis test method for feature selection/ranking
res <- fs.kruskal(mat,grp)
names(res)

Feature Selection by PCA

Description

Feature selection using PCA loadings.

Usage

  fs.pca(x,thres=0.8, ...)
fs.pca(x,thres=0.8, ...)

Arguments

`x`	A data frame or matrix of data set.
`thres`	The threshold of the cumulative percentage of PC's explained variances.
`...`	Additional arguments to `prcomp`.

Details

Since PCA loadings is a matrix with respect to PCs, the Mahalanobis distance of loadings is applied to select the features. (Other ways, for example, the sum of absolute values of loadings, or squared root of loadings, can be used.)

It should be noticed that this feature selection method is unsupervised.

Value

A list with components:

`fs.rank`	A vector of feature ranking scores.
`fs.order`	A vector of feature order from best to worst.
`stats`	A vector of measurements.

Author(s)

Wanchang Lin

Examples

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## feature selection by PCA
res <- fs.pca(dat)
names(res)

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## feature selection by PCA
res <- fs.pca(dat)
names(res)

Feature Selection Using PLS

Description

Feature selection using coefficient of regression and VIP values of PLS.

Usage

  fs.pls(x,y, pls="simpls",ncomp=10,...)
  fs.plsvip(x,y, ncomp=10,...)
  fs.plsvip.1(x,y, ncomp=10,...)
  fs.plsvip.2(x,y, ncomp=10,...)
fs.pls(x,y, pls="simpls",ncomp=10,...)
  fs.plsvip(x,y, ncomp=10,...)
  fs.plsvip.1(x,y, ncomp=10,...)
  fs.plsvip.2(x,y, ncomp=10,...)

Arguments

`x`	A data frame or matrix of data set.
`y`	A factor or vector of class.
`pls`	A method for calculating PLS scores and loadings. The following methods are supported: `simpls:` SIMPLS algorithm. `kernelpls:` kernel algorithm. `oscorespls:` orthogonal scores algorithm. For details, see `simpls.fit`, `kernelpls.fit` and `oscorespls.fit` in package pls.
`ncomp`	The number of components to be used.
`...`	Arguments passed to or from other methods.

Details

fs.pls ranks the features by regression coefficient of PLS. Since the coefficient is a matrix due to the dummy multiple response variables designed for the classification (category) problem, the Mahalanobis distance of coefficient is applied to select the features. (Other ways, for example, the sum of absolute values of coefficient, or squared root of coefficient, can be used.)

fs.plsvip and fs.plsvip.1 carry out feature selection based on the the Mahalanobis distance and absolute values of PLS's VIP, respectively.

fs.plsvip.2 is similar to fs.plsvip and fs.plsvip.1, but the category response is not treated as dummy multiple response matrix.

Value

A list with components:

`fs.rank`	A vector of feature ranking scores.
`fs.order`	A vector of feature order from best to worst.
`stats`	A vector of measurements.

Author(s)

Wanchang Lin

Examples

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## apply PLS methods for feature selection
res.pls      <- fs.pls(mat,grp, ncomp=4)
res.plsvip   <- fs.plsvip(mat,grp, ncomp=4)
res.plsvip.1 <- fs.plsvip.1(mat,grp, ncomp=4)
res.plsvip.2 <- fs.plsvip.2(mat,grp, ncomp=4)

## check differences among these methods
fs.order <- data.frame(pls      = res.pls$fs.order,
                       plsvip   = res.plsvip$fs.order,
                       plsvip.1 = res.plsvip.1$fs.order,
                       plsvip.2 = res.plsvip.2$fs.order)
head(fs.order, 20)

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## apply PLS methods for feature selection
res.pls      <- fs.pls(mat,grp, ncomp=4)
res.plsvip   <- fs.plsvip(mat,grp, ncomp=4)
res.plsvip.1 <- fs.plsvip.1(mat,grp, ncomp=4)
res.plsvip.2 <- fs.plsvip.2(mat,grp, ncomp=4)

## check differences among these methods
fs.order <- data.frame(pls      = res.pls$fs.order,
                       plsvip   = res.plsvip$fs.order,
                       plsvip.1 = res.plsvip.1$fs.order,
                       plsvip.2 = res.plsvip.2$fs.order)
head(fs.order, 20)

Feature Selection Using RELIEF Method

Description

Feature selection using RELIEF method.

Usage

  fs.relief(x,y, m=NULL, k=10, ...)
fs.relief(x,y, m=NULL, k=10, ...)

Arguments

`x`	A data frame or matrix of data set.
`y`	A factor or vector of class.
`m`	Number of instances to sample without replacement. Default is `NULL` which takes all instances for computation.
`k`	Number of nearest neighbours used to estimate feature relevance.
`...`	Arguments to pass to method (current ignore).

Details

This function implements the Relief algorithm's extension called ReliefF, which applies to multi-class problem and searches for k of its nearest neighbours from the same class, called hits, and also k nearest neighbours from each of the different classes, called misses.

Value

A list with components:

`fs.rank`	A vector of feature ranking scores.
`fs.order`	A vector of feature order from best to worst.
`stats`	A vector of measurements.

Author(s)

Wanchang Lin

References

Kira, K. and Rendel, L. (1992). The Feature Selection Problem: Traditional Methods and a new algorithm. Proc. Tenth National Conference on Artificial Intelligence, MIT Press, 129 - 134.

Kononenko, I., Simes, E., and Robnik-Sikonja, M. (1997). Overcoming the Myopia of Induction Learning Algorithms with RELIEFF. Applied Intelligence, Vol.7, 1, 39-55.

Kononenko, I. (1994) Estimating Attributes: Analysis and Extensions of RELIEF, European Conference on Machine Learning, Ed. Francesco Bergadano and Luc De Raedt, 171-182, Springer

Robnik-Sikonja, M. and Kononenko, I. (2003) Theoretical and Empirical Analysis of ReliefF and RReliefF, Machine Learning, 53, 23 - 69.

Examples

data(iris)
x <- subset(iris, select = -Species)
y <- iris$Species

fs <- fs.relief(x, y, m=20,k=10)

data(iris)
x <- subset(iris, select = -Species)
y <- iris$Species

fs <- fs.relief(x, y, m=20,k=10)

Feature Selection Using Random Forests (RF)

Description

Feature selection using Random Forests (RF).

Usage

  fs.rf(x,y,...)
  fs.rf.1(x,y,fs.len="power2",...)
fs.rf(x,y,...)
  fs.rf.1(x,y,fs.len="power2",...)

Arguments

`x`	A data frame or matrix of data set.
`y`	A factor or vector of class.
`fs.len`	Method or numeric sequence for feature lengths. For details, see `get.fs.len`
`...`	Arguments to pass to `randomForests`.

Details

fs.rf.1 select features based on successively eliminating the least important variables.

Value

A list with components:

`fs.rank`	A vector of feature ranking scores.
`fs.order`	A vector of feature order from best to worst.
`stats`	A vector of measurements. For `fs.rf`, it is Random Forest important score. For `fs.rf.1`, it is a dummy variable (current ignored).

Author(s)

Wanchang Lin

Examples

data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
dat <- dat[,mv$mv.var < 0.15]

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## apply random forests for feature selection/ranking
res   <- fs.rf(mat,grp)
res.1 <- fs.rf.1(mat,grp)

## compare the results
fs <- cbind(fs.rf=res$fs.order, fs.rf.1=res.1$fs.order)

## plot the important score of 'fs.rf' (not 'fs.rf.1')
score <- res$stats
score <- sort(score, decreasing = TRUE)
plot(score)

data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
dat <- dat[,mv$mv.var < 0.15]

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## apply random forests for feature selection/ranking
res   <- fs.rf(mat,grp)
res.1 <- fs.rf.1(mat,grp)

## compare the results
fs <- cbind(fs.rf=res$fs.order, fs.rf.1=res.1$fs.order)

## plot the important score of 'fs.rf' (not 'fs.rf.1')
score <- res$stats
score <- sort(score, decreasing = TRUE)
plot(score)

Feature Selection Using SVM-RFE

Description

Feature selection using Support Vector Machine based on Recursive Feature Elimination (SVM-RFE)

Usage

  fs.rfe(x,y,fs.len="power2",...)
fs.rfe(x,y,fs.len="power2",...)

Arguments

`x`	A data frame or matrix of data set.
`y`	A factor or vector of class.
`fs.len`	Method for feature lengths used in SVM-RFE computation. For details, see `get.fs.len`.
`...`	Arguments to pass to `svm`.

Value

A list with components:

`fs.rank`	A vector of feature ranking scroes.
`fs.order`	A vector of feature order from best to worst.

Author(s)

Wanchang Lin

Examples

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## apply RFE method for feature selection/ranking
res <- fs.rfe(mat,grp)
names(res)

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## apply RFE method for feature selection/ranking
res <- fs.rfe(mat,grp)
names(res)

Feature Selection Using Signal-to-Noise Ratio (SNR)

Description

Feature selection using signal-to-noise ratio (SNR).

Usage

  fs.snr(x,y,...)
fs.snr(x,y,...)

Arguments

`x`	A data frame or matrix of data set.
`y`	A factor or vector of class.
`...`	Arguments to pass(current ignored).

Value

A list with components:

`fs.rank`	A vector of feature ranking scores.
`fs.order`	A vector of feature order from best to worst.
`stats`	A vector of measurements.

Note

This function is for two-class problem only.

Author(s)

Wanchang Lin

Examples

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## apply SNR method for feature selection/ranking
res <- fs.snr(mat,grp)
names(res)

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## apply SNR method for feature selection/ranking
res <- fs.snr(mat,grp)
names(res)

Feature Selection Using Welch Test

Description

Feature selection using Welch test.

Usage

  fs.welch(x,y,...)
fs.welch(x,y,...)

Arguments

`x`	A data frame or matrix of data set.
`y`	A factor or vector of class.
`...`	Arguments to pass to method.

Value

A list with components:

`fs.rank`	A vector of feature ranking scores.
`fs.order`	A vector of feature order from best to worst.
`stats`	A vector of statistics.
`pval`	A vector of p values.

Note

This function is for two-class problem only.

Author(s)

Wanchang Lin

Examples

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## apply Welch method for feature selection/ranking
res <- fs.welch(mat,grp)
names(res)

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## apply Welch method for feature selection/ranking
res <- fs.welch(mat,grp)
names(res)

Feature Selection Using Wilcoxon Test

Description

Feature selection using Wilcoxon test.

Usage

  fs.wilcox(x,y,...)
fs.wilcox(x,y,...)

Arguments

`x`	A data frame or matrix of data set.
`y`	A factor or vector of class.
`...`	Arguments to pass to method.

Value

A list with components:

`fs.rank`	A vector of feature ranking scores.
`fs.order`	A vector of feature order from best to worst.
`stats`	A vector of statistics.
`pval`	A vector of p values.

Note

This function is for two-class problem only.

Author(s)

Wanchang Lin

Examples

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## apply Welch method for feature selection/ranking
res <- fs.wilcox(mat,grp)
names(res)

## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos
## dat <- abr1$pos[,110:1930]

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
mv    ## View the missing value pattern

## filter missing value variables
## dim(dat)
dat <- dat[,mv$mv.var < 0.15]
## dim(dat)

## fill NAs with mean
dat <- mv.fill(dat,method="mean")

## log transformation
dat <- preproc(dat, method="log10")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
mat <- dat[ind,,drop=FALSE] 
mat <- as.matrix(mat)
grp <- cls[ind, drop=TRUE]   

## apply Welch method for feature selection/ranking
res <- fs.wilcox(mat,grp)
names(res)

Get Length of Feature Subset for Validation

Description

Get feature lengths for feature selection validation by classification.

Usage

  get.fs.len(p,fs.len=c("power2"))
get.fs.len(p,fs.len=c("power2"))

Arguments

p

Number of features in the data set.

fs.len

Method or numeric sequence for feature lengths. It can be a numeric vector as user-defined feature lengths, or methods:

full. The feature lengths are p, ..., 2, 1. This is an exhaustive method. If p is too large, it will consume a lot of time and hence it is not practical.
half. The feature lengths are the sequence of p, p/2, p/2/2, ..., 1.
power2. The feature lengths are the sequence of p, 2^(log2(p)-1), ..., 2^1, 2^0.

Details

The generation of feature length is used in the validation of feature subsets by classification. The feature length decide the lengths of feature subset starting from top of the full feature order list.

Value

An descending order numeric vector of feature lengths.

Note

The last length of feature returned is always p.

Author(s)

Wanchang Lin

Examples


data(abr1)
dat <- abr1$pos

## number of featres
p <- ncol(dat)

## predefined feature lengths. The returned will be descending order 
## vector with the first one is p.
(vec <- get.fs.len(p, fs.len=c(1,2,3,4,5,6,7,8,9,10)))

## use all features as feature lengths
(vec.full <- get.fs.len(p, fs.len="full"))

## use "half"
(vec.half <- get.fs.len(p, fs.len="half"))

## use "power2"
(vec.power2 <- get.fs.len(p, fs.len="power2"))

data(abr1)
dat <- abr1$pos

## number of featres
p <- ncol(dat)

## predefined feature lengths. The returned will be descending order 
## vector with the first one is p.
(vec <- get.fs.len(p, fs.len=c(1,2,3,4,5,6,7,8,9,10)))

## use all features as feature lengths
(vec.full <- get.fs.len(p, fs.len="full"))

## use "half"
(vec.half <- get.fs.len(p, fs.len="half"))

## use "power2"
(vec.power2 <- get.fs.len(p, fs.len="power2"))

Plot Matrix-Like Object by Group

Description

Plot matrix-like object by group

Usage

  grpplot(x, y, plot = "pairs", ...)
grpplot(x, y, plot = "pairs", ...)

Arguments

`x`	A matrix or data frame to be plotted.
`y`	A factor or vector giving group information of columns of `x`.
`plot`	One of plot types: `strip`, `box`, `density` and `pairs`.
`...`	Further arguments. See corresponding entry in `xyplot` for non-trivial details. One argument is `ep`: an integer for plotting ellipse. `1` and `2` for plotting overall and group ellipse, respectively. Otherwise, none. For details, see `panel.elli.1`.

Value

An object of class "trellis".

Author(s)

Wanchang Lin

Examples

  data(iris)
  grpplot(iris[,1:4], iris[,5],plot="strip", main="IRIS DATA")
  grpplot(iris[,1:4], iris[,5],plot="box", main="IRIS DATA")
  grpplot(iris[,1:4], iris[,5],plot="density", main="IRIS DATA")
  grpplot(iris[,1:4], iris[,5],plot="pairs",main="IRIS DATA",ep=2)

  ## returns an object of  class "trellis".
  tmp <- grpplot(iris[,c(2,1)], iris[,5],main="IRIS DATA",ep=2)
  tmp
  
  ## change symbol's color, type and size
  grpplot(iris[,c(2,1)], iris[,5],main="IRIS DATA", cex=1.5,
         auto.key=list(space="right", col=c("black","blue","red")),
         par.settings = list(superpose.symbol = list(col=c("black","blue","red"),
                                                     pch=c(1:3))))

data(iris)
  grpplot(iris[,1:4], iris[,5],plot="strip", main="IRIS DATA")
  grpplot(iris[,1:4], iris[,5],plot="box", main="IRIS DATA")
  grpplot(iris[,1:4], iris[,5],plot="density", main="IRIS DATA")
  grpplot(iris[,1:4], iris[,5],plot="pairs",main="IRIS DATA",ep=2)

  ## returns an object of  class "trellis".
  tmp <- grpplot(iris[,c(2,1)], iris[,5],main="IRIS DATA",ep=2)
  tmp
  
  ## change symbol's color, type and size
  grpplot(iris[,c(2,1)], iris[,5],main="IRIS DATA", cex=1.5,
         auto.key=list(space="right", col=c("black","blue","red")),
         par.settings = list(superpose.symbol = list(col=c("black","blue","red"),
                                                     pch=c(1:3))))

List Manipulation Utilities

Description

Functions to handle manipulation of list.

Usage

list2df(x)

un.list(x, y="")

shrink.list(x)
list2df(x)

un.list(x, y="")

shrink.list(x)

Arguments

`x`	A list to be manipulated.
`y`	A character or string of separator.

Details

list2df converts a list with components of vector to a data frame. Shorter vectors will be filled with NA. It is useful to convert rugged vectors into a data frame which can be written to an Excel file.

un.list collapses higher-depths list to 1-depth list. This function uses recursive programming skill to tackle any depths of list.

shrink.list removes all NULL or NA entries from a list.

Value

list2df returns a data frame. un.list returns a list. shrink.list retuns a list.

Author(s)

Wanchang Lin

Examples

 ## See examples of function feat.mfs for the usages of list2df and un.list.
a <- list(x=1, y=NA, z=NULL)
b <- list(x=1, y=NA)
c <- list(x=1, z=NULL)

shrink.list(a)
shrink.list(b)
shrink.list(c)

## See examples of function feat.mfs for the usages of list2df and un.list.
a <- list(x=1, y=NA, z=NULL)
b <- list(x=1, y=NA)
c <- list(x=1, z=NULL)

shrink.list(a)
shrink.list(b)
shrink.list(c)

Estimation of Multiple Classification Accuracy

Description

Estimation of classification accuracy by multiple classifiers with resampling procedure and comparisons of multiple classifiers.

Usage

maccest(dat, ...)
## Default S3 method:
maccest(dat, cl, method="svm", pars = valipars(), 
        tr.idx = NULL, comp="anova",...) 
## S3 method for class 'formula'
maccest(formula, data = NULL, ..., subset, na.action = na.omit)
maccest(dat, ...)
## Default S3 method:
maccest(dat, cl, method="svm", pars = valipars(), 
        tr.idx = NULL, comp="anova",...) 
## S3 method for class 'formula'
maccest(formula, data = NULL, ..., subset, na.action = na.omit)

Arguments

`formula`	A formula of the form `groups ~ x1 + x2 + ...` That is, the response is the grouping factor and the right hand side specifies the (non-factor) discriminators.
`data`	Data frame from which variables specified in `formula` are preferentially to be taken.
`dat`	A matrix or data frame containing the explanatory variables if no formula is given as the principal argument.
`cl`	A factor specifying the class for each observation if no formula principal argument is given.
`method`	A vector of multiple classification methods to be used. Classifiers, such as `randomForest`, `svm`, `knn` and `lda`, can be used. For details, see `note` below.
`pars`	A list of resampling scheme such as Leave-one-out cross-validation, Cross-validation, Randomised validation (holdout) and Bootstrap, and control parameters for the calculation of accuracy. See `valipars` for details.
`tr.idx`	User defined index of training samples. Can be generated by `trainind`.
`comp`	Comparison method of multiple classifier. If `comp` is `anova`, the multiple comparisons are performed by `ANOVA` and then the pairwise comparisons are performed by `HSDTukey`. If `comp` is `fried`, the multiple comparisons are performed by `Friedman Test` and the pairwise comparisons are performed by `Wilcoxon Test`.
`...`	Additional parameters to `method`.
`subset`	Optional vector, specifying a subset of observations to be used.
`na.action`	Function which indicates what should happen when the data contains `NA`'s, defaults to `na.omit`.

Details

The accuracy rates for classification are obtained used techniques such as Random Forest, Support Vector Machine, k-Nearest Neighbour Classification, Linear Discriminant Analysis and Linear Discriminant Analysis based on sampling methods, including Leave-one-out cross-validation, Cross-validation, Randomised validation (holdout) and Bootstrap.

Value

An object of class maccest, including the components:

`method`	Classification method used.
`acc`	Accuracy rate.
`acc.iter`	Accuracy rate of each iteration.
`acc.std`	Standard derivation of accuracy rate.
`mar`	Prediction margin.
`mar.iter`	Prediction margin of each iteration.
`auc`	The area under receiver operating curve (AUC).
`auc.iter`	AUC of each iteration.
`comp`	Multiple comparison method used.
`h.test`	Hypothesis test results of multiple comparison.
`gl.pval`	Global or overall p-value.
`mc.pval`	Pairwise comparison p-values.
`sampling`	Sampling scheme used.
`niter`	Number of iteration.
`nreps`	Number of replications in each iteration.
`conf.mat`	Overall confusion matrix.
`acc.boot`	A list of bootrap error such as `.632` and `.632+` if the validation method is bootrap.

Note

The maccest can take any classification model if its argument format is model(formula, data, subset, na.action, ...) and their corresponding method predict.model(object, newdata, ...) can either return the only predicted class label or in a list with name as class, such as lda and pcalda.

As for the multiple comparisons by ANOVA, the following assumptions should be considered:

The samples are randomly and independently selected.
The populations are normally distributed.
The populations all have the same variance.

All the comparisons are based on the results of all iteration.

aam.mcl is a simplified version which returns acc (accuracy), auc(area under ROC curve) and mar(class margin).

Author(s)

Wanchang Lin

Examples

# Iris data
data(iris)
x      <- subset(iris, select = -Species)
y      <- iris$Species

method <- c("randomForest","svm","pcalda","knn")
pars   <- valipars(sampling="boot", niter = 3, nreps=5, strat=TRUE)
res    <- maccest(Species~., data = iris, method=method, pars = pars, 
                  comp="anova")
## or 
res    <- maccest(x, y, method=method, pars=pars, comp="anova") 

res
summary(res)
plot(res)
boxplot(res)
oldpar <- par(mar = c(5,10,4,2) + 0.1)
plot(res$h.test$tukey,las=1)   ## plot the tukey results
par(oldpar)
# Iris data
data(iris)
x      <- subset(iris, select = -Species)
y      <- iris$Species

method <- c("randomForest","svm","pcalda","knn")
pars   <- valipars(sampling="boot", niter = 3, nreps=5, strat=TRUE)
res    <- maccest(Species~., data = iris, method=method, pars = pars, 
                  comp="anova")
## or 
res    <- maccest(x, y, method=method, pars=pars, comp="anova") 

res
summary(res)
plot(res)
boxplot(res)
oldpar <- par(mar = c(5,10,4,2) + 0.1)
plot(res$h.test$tukey,las=1)   ## plot the tukey results
par(oldpar)

Binary Classification by Multiple Classifier

Description

Binary classification by multiple classifier.

Usage

mbinest(dat, cl, choices = NULL, method, pars=valipars(),...) 
mbinest(dat, cl, choices = NULL, method, pars=valipars(),...)

Arguments

`dat`	A matrix or data frame containing the explanatory variables.
`cl`	A factor specifying the class for each observation.
`choices`	The vector or list of class labels to be chosen for binary classification. For details, see `dat.sel`.
`method`	Multiple classification methods to be used. For details, see `maccest`.
`pars`	A list of parameters of the resampling method. See `valipars` for details.
`...`	Additional parameters to `method`.

Value

A list with components:

`all`	All results of classification.
`com`	A matrix of the combinations of the binary class labels.
`acc`	A table of classification accuracy for the binary combination.
`mar`	Prediction margin.
`auc`	The area under receiver operating curve (AUC).
`method`	Classification methods used.
`niter`	Number of iterations.
`sampling`	Sampling scheme used.
`nreps`	Number of replications in each iteration if sampling is not `loocv`.

Author(s)

Wanchang Lin

Examples

## iris data set
data(iris)
dat    <- subset(iris, select = -Species)
cl     <- iris$Species
method <- c("svm","pcalda")

pars  <- valipars(sampling="cv",niter = 10, nreps = 5)
res   <- mbinest(dat,cl,choices=c("setosa"), method=method,
                  pars = pars, kernel="linear")

## combine prediction accuracy, AUC and margin 
z      <- round(cbind(res$acc,res$auc,res$mar),digits=3)
colnames(z) <- c(paste(method,".acc", sep=""),paste(method,".auc", sep=""),
                 paste(method,".mar", sep=""))

## iris data set
data(iris)
dat    <- subset(iris, select = -Species)
cl     <- iris$Species
method <- c("svm","pcalda")

pars  <- valipars(sampling="cv",niter = 10, nreps = 5)
res   <- mbinest(dat,cl,choices=c("setosa"), method=method,
                  pars = pars, kernel="linear")

## combine prediction accuracy, AUC and margin 
z      <- round(cbind(res$acc,res$auc,res$mar),digits=3)
colnames(z) <- c(paste(method,".acc", sep=""),paste(method,".auc", sep=""),
                 paste(method,".mar", sep=""))

Multiple Comparison by 'ANOVA' and Pairwise Comparison by 'HSDTukey Test'

Description

Performs multiple comparison by ANOVA and pairwise comparison by HSDTukey Test.

Usage

mc.anova(x, ...)
mc.anova(x, ...)

Arguments

`x`	A matrix or data frame to be tested.
`...`	Additional arguments pass to `anova` or `HSDTukey test`.

Value

A list with components:

`anova`	Hypothesis test results of `anova`.
`tukey`	Hypothesis test results of `HSDTukey.test`.
`gl.pval`	Global or overall p value returned by `anova`.
`mc.pval`	Pairwise p value returned by `HSDTukey.test`.

Author(s)

Wanchang Lin

Examples


# Iris data
data(iris)
x      <- subset(iris, select = -Species)
y      <- iris$Species

method <- c("randomForest","svm","pcalda","knn")
pars   <- valipars(sampling="boot", niter = 10, nreps=4)
res    <- maccest(x, y, method=method, pars=pars, comp="anova") 

res
htest <- mc.anova(res$acc.iter)

oldpar <- par(mar = c(5,10,4,2) + 0.1)
plot(htest$tukey,las=1)   ## plot the tukey results
par(oldpar)
# Iris data
data(iris)
x      <- subset(iris, select = -Species)
y      <- iris$Species

method <- c("randomForest","svm","pcalda","knn")
pars   <- valipars(sampling="boot", niter = 10, nreps=4)
res    <- maccest(x, y, method=method, pars=pars, comp="anova") 

res
htest <- mc.anova(res$acc.iter)

oldpar <- par(mar = c(5,10,4,2) + 0.1)
plot(htest$tukey,las=1)   ## plot the tukey results
par(oldpar)

Multiple Comparison by 'Friedman Test' and Pairwise Comparison by 'Wilcoxon Test'

Description

Performs multiple comparison by Friedman test and pairwise comparison by Wilcoxon Test.

Usage

mc.fried(x, p.adjust.method = p.adjust.methods,...)
mc.fried(x, p.adjust.method = p.adjust.methods,...)

Arguments

`x`	A matrix or data frame to be tested.
`p.adjust.method`	Method for adjusting p values (see `p.adjust`).
`...`	Additional arguments pass to `friedman.test` or `pairwise.wilcox.test`.

Value

A list with components:

`fried`	Hypothesis test results of `friedman.test`.
`wilcox`	Hypothesis test results of `pairwise.wilcox.test`.
`gl.pval`	Global or overall p value returned by `friedman.test`.
`mc.pval`	Pairwise p value returned by `pairwise.wilcox.test`.

Author(s)

Wanchang Lin

Examples


# Iris data
data(iris)
x      <- subset(iris, select = -Species)
y      <- iris$Species

method <- c("randomForest","svm","pcalda","knn")
pars   <- valipars(sampling="cv", niter = 10, nreps=4)
res    <- maccest(x, y, method=method, pars=pars,
                  comp="fried",kernel="linear") 

res

htest <- mc.fried(res$acc.iter)

# Iris data
data(iris)
x      <- subset(iris, select = -Species)
y      <- iris$Species

method <- c("randomForest","svm","pcalda","knn")
pars   <- valipars(sampling="cv", niter = 10, nreps=4)
res    <- maccest(x, y, method=method, pars=pars,
                  comp="fried",kernel="linear") 

res

htest <- mc.fried(res$acc.iter)

Normality Test by Shapiro-Wilk Test

Description

Perform Shapiro-Wilk normality test by shapiro.test and plot the density function and boxplot.

Usage

mc.norm(x, ...)
mc.norm(x, ...)

Arguments

`x`	A matrix or data frame to be tested.
`...`	Additional arguments pass to `shapiro.test`.

Value

Object of shapiro.test, boxplot and histogram.

Author(s)

Wanchang Lin

Examples

data(iris)
x      <- subset(iris, select = -Species)
y      <- iris$Species
method <- c("randomForest","svm","pcalda","knn")
pars   <- valipars(sampling="boot", niter = 10, nreps=10)
res    <- maccest(x, y, method=method, pars=pars,
                  comp="anova") 
res
res$acc.iter
mc.norm(res$acc.iter)
data(iris)
x      <- subset(iris, select = -Species)
y      <- iris$Species
method <- c("randomForest","svm","pcalda","knn")
pars   <- valipars(sampling="boot", niter = 10, nreps=10)
res    <- maccest(x, y, method=method, pars=pars,
                  comp="anova") 
res
res$acc.iter
mc.norm(res$acc.iter)

Plot Classical Multidimensional Scaling

Description

Plot metric MDS with categorical information.

Usage

mdsplot(x, y, method = "euclidean", dimen = c(1,2), ...)
mdsplot(x, y, method = "euclidean", dimen = c(1,2), ...)

Arguments

`x`	A matrix or data frame to be plotted.
`y`	A factor or vector giving group information of columns of `x`.
`method`	The distance measure to be used. This must be one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". Any unambiguous substring can be given. It is only for `mds_plot_wrap`.
`dimen`	A vector of index of dimentonal to be plotted. Only two dimentions are are allowed.
`...`	Further arguments to `prcomp` or `lattice`. See corresponding entry in `xyplot` for non-trivial details of `lattice`. For `pcaplot`, one argument is `ep`: an integer for plotting 95% ellipse. `1` and `2` for plotting overall and group ellipse, respectively. Otherwise, none. For details, see `panel.elli.1`.

Value

mdsplot returns an object of class "trellis".

Author(s)

Wanchang Lin

Examples

## examples of 'mdsplot'

data(iris)
x <- iris[,1:4]
y <- iris[,5]
mdsplot(x,y, dimen=c(1,2),ep=2)
mdsplot(x,y, dimen=c(2,1),ep=1)

tmp <- mdsplot(x,y, ep=2, conf.level = 0.9)
tmp

## change symbol's color, type and size
mdsplot(x, y, main="IRIS DATA", cex=1.2,
  auto.key=list(space="right", col=c("black","blue","red"), cex=1.2),
  par.settings = list(superpose.symbol = list(col=c("black","blue","red"),
                                              pch=c(1:3))))
## examples of 'mdsplot'

data(iris)
x <- iris[,1:4]
y <- iris[,5]
mdsplot(x,y, dimen=c(1,2),ep=2)
mdsplot(x,y, dimen=c(2,1),ep=1)

tmp <- mdsplot(x,y, ep=2, conf.level = 0.9)
tmp

## change symbol's color, type and size
mdsplot(x, y, main="IRIS DATA", cex=1.2,
  auto.key=list(space="right", col=c("black","blue","red"), cex=1.2),
  par.settings = list(superpose.symbol = list(col=c("black","blue","red"),
                                              pch=c(1:3))))

Missing Value Utilities

Description

Functions to handle missing values of data set.

Usage

mv.stats(dat,grp=NULL,...) 
  
mv.fill(dat,method="mean",ze_ne = FALSE)

mv.zene(dat)
  
mv.stats(dat,grp=NULL,...) 
  
mv.fill(dat,method="mean",ze_ne = FALSE)

mv.zene(dat)

Arguments

`dat`	A data frame or matrix of data set.
`grp`	A factor or vector of class.
`method`	Univariate imputation method for missing value. For details, see examples below.
`ze_ne`	A logical value indicating whether the zeros or negatives should be treated as missing values.
`...`	Additional parameters to `mv.stats` for plotting using lattice.

Value

mv.fill returns an imputed data frame.

mv.zene returns an NA-filled data frame.

mv.stats returns a list including the components:

mv.overall: Overall missng value rate.
mv.var: Missing value rate per variable (column).
mv.grp: A matrix of missing value rate for different groups if argument grp is given.
mv.grp.plot: An object of class trellis for plotting of mv.grp if argument grp is given.

Author(s)

Wanchang Lin

Examples

data(abr1)
dat <- abr1$pos[,1970:1980]
cls <- factor(abr1$fact$class)

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
plot(mv$mv.grp.plot)

## fill NAs with mean
dat.mean <- mv.fill(dat,method="mean")

## fill NAs with median
dat.median <- mv.fill(dat,method="median")

## -----------------------------------------------------------------------
## fill NAs with user-defined methods: two examples given here.
## a.) Random imputation function:
rand <- function(x,...) sample(x[!is.na(x)], sum(is.na(x)), replace=TRUE)

## test this function:
(tmp <- dat[,1])        ## an vector with NAs
## get the randomised values for NAs
rand(tmp)

## fill NAs with method "rand"
dat.rand <- mv.fill(dat,method="rand")

## b.) "Low" imputation function:
"low" <- function(x, ...) {
  max(mean(x,...) - 3 * sd(x,...), min(x, ...)/2)
}
## fill NAs with method "low"
dat.low <- mv.fill(dat, method="low") 

## summary of imputed data set
df.summ(dat.mean)

data(abr1)
dat <- abr1$pos[,1970:1980]
cls <- factor(abr1$fact$class)

## fill zeros with NAs
dat <- mv.zene(dat)

## missing values summary
mv <- mv.stats(dat, grp=cls) 
plot(mv$mv.grp.plot)

## fill NAs with mean
dat.mean <- mv.fill(dat,method="mean")

## fill NAs with median
dat.median <- mv.fill(dat,method="median")

## -----------------------------------------------------------------------
## fill NAs with user-defined methods: two examples given here.
## a.) Random imputation function:
rand <- function(x,...) sample(x[!is.na(x)], sum(is.na(x)), replace=TRUE)

## test this function:
(tmp <- dat[,1])        ## an vector with NAs
## get the randomised values for NAs
rand(tmp)

## fill NAs with method "rand"
dat.rand <- mv.fill(dat,method="rand")

## b.) "Low" imputation function:
"low" <- function(x, ...) {
  max(mean(x,...) - 3 * sd(x,...), min(x, ...)/2)
}
## fill NAs with method "low"
dat.low <- mv.fill(dat, method="low") 

## summary of imputed data set
df.summ(dat.mean)

Orthogonal Signal Correction (OSC)

Description

Data pre-processing by orthogonal signal correction (OSC).

Usage

osc(x, ...)

## Default S3 method:
osc(x, y, method="wold",center=TRUE,osc.ncomp=4,pls.ncomp=10,
   tol=1e-3, iter=20,...)

## S3 method for class 'formula'
osc(formula, data = NULL, ..., subset, na.action = na.omit)
osc(x, ...)

## Default S3 method:
osc(x, y, method="wold",center=TRUE,osc.ncomp=4,pls.ncomp=10,
   tol=1e-3, iter=20,...)

## S3 method for class 'formula'
osc(formula, data = NULL, ..., subset, na.action = na.omit)

Arguments

`formula`	A formula of the form `groups ~ x1 + x2 + ...` That is, the response is the grouping factor and the right hand side specifies the (non-factor) discriminators.
`data`	Data frame from which variables specified in `formula` are preferentially to be taken.
`x`	A matrix or data frame containing the explanatory variables if no formula is given as the principal argument.
`y`	A factor specifying the class for each observation if no formula principal argument is given.
`method`	A method for calculating OSC weights, loadings and scores. The following methods are supported: `wold:` Original Wold et al approach. `sjoblom:` Sjoblom et al approach. `wise:` Wise and Gallagher approach.
`center`	A logical value indicating whether the data set should be centred by column-wise.
`osc.ncomp`	The number of components to be used in the OSC calculation.
`pls.ncomp`	The number of components to be used in the PLS calculation.
`tol`	A scalar value of tolerance for OSC computation.
`iter`	The number of iteration used in OSC calculation.
`...`	Arguments passed to or from other methods.
`subset`	An index vector specifying the cases to be used in the training sample.
`na.action`	A function to specify the action to be taken if `NA`s are found. The default action is `na.omit`, which leads to rejection of cases with missing values on any required variable. An alternative is `na.fail`, which causes an error if `NA` cases are found.

Value

An object of class osc containing the following components:

`x`	A matrix of OSC corrected data set.
`R2`	R2 statistics. It is calculated as the fraction of variation in X after OSC correction for the calibration (training) data.
`angle`	An angle used for checking if scores `t` is orthogonal to `y`. An angle close to 90 degree means that orthogonality is achieved in the correction process.
`w`	A matrix of OSC weights.
`p`	A matrix of OSC loadings.
`t`	A matrix of OSC scores.
`call`	The (matched) function call.
`center`	A logical value indicating whether the data set has been centred by column-wise.
`osc.ncomp`	The number of component used in OSC computation.
`pls.ncomp`	The number of component used in PLS computation.
`method`	The OSC algorithm used.

Note

This function may be called giving either a formula and optional data frame, or a matrix and grouping factor as the first two arguments.

Author(s)

Wanchang Lin

References

Wold, S., Antti, H., Lindgren, F., Ohman, J.(1998). Orthogonal signal correction of near infrared spectra. Chemometrics Intell. Lab. Syst., 44: 175-185.

Westerhuis, J. A., de Jong, S., Smilde, A, K. (2001). Direct orthogonal signal correction. Chemometrics Intell. Lab. Syst., 56: 13-25.

Sjoblom. J., Svensson, O., Josefson, M., Kullberg, H., Wold, S. (1998). An evaluation of orthogonal signal correction applied to calibration transfer of near infrared spectra. Chemometrics Intell. Lab. Syst.,44: 229-244.

Svensson, O., Kourti, T. and MacGregor, J.F. (2002). An investigation of orthogonal correction algorithms and their characteristics. Journal of Chemometrics, 16:176-188.

Wise, B. M. and Gallagher, N.B. http://www.eigenvector.com/MATLAB/OSC.html.

Examples

data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- abr1$pos

## divide data as training and test data
idx <- sample(1:nrow(dat), round((2/3)*nrow(dat)), replace=FALSE) 

## construct train and test data 
train.dat  <- dat[idx,]
train.t    <- cl[idx]
test.dat   <- dat[-idx,]        
test.t     <- cl[-idx] 

## build OSC model based on the training data
res <- osc(train.dat, train.t, method="wise", osc.ncomp=2, pls.ncomp=4)
names(res)
res
summary(res)

## pre-process test data by OSC
test.dat.1 <- predict(res,test.dat)$x

data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- abr1$pos

## divide data as training and test data
idx <- sample(1:nrow(dat), round((2/3)*nrow(dat)), replace=FALSE) 

## construct train and test data 
train.dat  <- dat[idx,]
train.t    <- cl[idx]
test.dat   <- dat[-idx,]        
test.t     <- cl[-idx] 

## build OSC model based on the training data
res <- osc(train.dat, train.t, method="wise", osc.ncomp=2, pls.ncomp=4)
names(res)
res
summary(res)

## pre-process test data by OSC
test.dat.1 <- predict(res,test.dat)$x

Orthogonal Signal Correction (OSC) Approach by Sjoblom et al.

Description

Orthogonal signal correction (OSC) approach by Sjoblom et al.

Usage

  osc_sjoblom(x, y, center=TRUE,osc.ncomp=4,pls.ncomp=10,
              tol=1e-3,iter=20,...)
osc_sjoblom(x, y, center=TRUE,osc.ncomp=4,pls.ncomp=10,
              tol=1e-3,iter=20,...)

Arguments

`x`	A numeric data frame or matrix to be pre-processed.
`y`	A vector or factor specifying the class for each observation.
`center`	A logical value indicating whether the data set should be centred by column-wise.
`osc.ncomp`	The number of components to be used in the OSC calculation.
`pls.ncomp`	The number of components to be used in the PLS calculation.
`tol`	A scalar value of tolerance for OSC computation.
`iter`	The number of iteration used in OSC calculation.
`...`	Arguments passed to or from other methods.

Value

A list containing the following components:

`x`	A matrix of OSC corrected data set.
`R2`	R2 statistics. It is calculated as the fraction of variation in X after OSC correction.
`angle`	An angle used for checking if scores `t` is orthogonal to `y`. An angle close to 90 degree means that orthogonality is achieved in the correction process.
`w`	A matrix of OSC weights.
`p`	A matrix of OSC loadings.
`t`	A matrix of OSC scores.
`center`	A logical value indicating whether the data set has been centred by column-wise.

Author(s)

Wanchang Lin

References

Svensson, O., Kourti, T. and MacGregor, J.F. (2002). An investigation of orthogonal correction algorithms and their characteristics. Journal of Chemometrics, 16:176-188.

Westerhuis, J. A., de Jong, S., Smilde, A, K. (2001). Direct orthogonal signal correction. Chemometrics Intell. Lab. Syst., 56: 13-25.

Examples

data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- abr1$pos

## divide data as training and test data
idx <- sample(1:nrow(dat), round((2/3)*nrow(dat)), replace=FALSE) 

## construct train and test data 
train.dat  <- dat[idx,]
train.t    <- cl[idx]
test.dat   <- dat[-idx,]        
test.t     <- cl[-idx] 

## build OSC model based on the training data
res <- osc_sjoblom(train.dat, train.t)
names(res)

## pre-process test data by OSC
test.dat.1 <- predict.osc(res,test.dat)$x

data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- abr1$pos

## divide data as training and test data
idx <- sample(1:nrow(dat), round((2/3)*nrow(dat)), replace=FALSE) 

## construct train and test data 
train.dat  <- dat[idx,]
train.t    <- cl[idx]
test.dat   <- dat[-idx,]        
test.t     <- cl[-idx] 

## build OSC model based on the training data
res <- osc_sjoblom(train.dat, train.t)
names(res)

## pre-process test data by OSC
test.dat.1 <- predict.osc(res,test.dat)$x

Orthogonal Signal Correction (OSC) Approach by Wise and Gallagher.

Description

Orthogonal signal correction (OSC) approach by Wise and Gallagher.

Usage

  osc_wise(x, y, center=TRUE,osc.ncomp=4,pls.ncomp=10,
           tol=1e-3,iter=20,...)
osc_wise(x, y, center=TRUE,osc.ncomp=4,pls.ncomp=10,
           tol=1e-3,iter=20,...)

Arguments

`x`	A numeric data frame or matrix to be pre-processed.
`y`	A vector or factor specifying the class for each observation.
`center`	A logical value indicating whether the data set should be centred by column-wise.
`osc.ncomp`	The number of components to be used in the OSC calculation.
`pls.ncomp`	The number of components to be used in the PLS calculation.
`tol`	A scalar value of tolerance for OSC computation.
`iter`	The number of iteration used in OSC calculation.
`...`	Arguments passed to or from other methods.

Value

A list containing the following components:

`x`	A matrix of OSC corrected data set.
`R2`	R2 statistics. It is calculated as the fraction of variation in X after OSC correction.
`angle`	An angle used for checking if scores `t` is orthogonal to `y`. An angle close to 90 degree means that orthogonality is achieved in the correction process.
`w`	A matrix of OSC weights.
`p`	A matrix of OSC loadings.
`t`	A matrix of OSC scores.
`center`	A logical value indicating whether the data set has been centred by column-wise.

Author(s)

Wanchang Lin

References

Westerhuis, J. A., de Jong, S., Smilde, A, K. (2001). Direct orthogonal signal correction. Chemometrics Intell. Lab. Syst., 56: 13-25.

Wise, B. M. and Gallagher, N.B. http://www.eigenvector.com/MATLAB/OSC.html.

Examples

data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- abr1$pos

## divide data as training and test data
idx <- sample(1:nrow(dat), round((2/3)*nrow(dat)), replace=FALSE) 

## construct train and test data 
train.dat  <- dat[idx,]
train.t    <- cl[idx]
test.dat   <- dat[-idx,]        
test.t     <- cl[-idx] 

## build OSC model based on the training data
res <- osc_wise(train.dat, train.t)
names(res)

## pre-process test data by OSC
test.dat.1 <- predict.osc(res,test.dat)$x

data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- abr1$pos

## divide data as training and test data
idx <- sample(1:nrow(dat), round((2/3)*nrow(dat)), replace=FALSE) 

## construct train and test data 
train.dat  <- dat[idx,]
train.t    <- cl[idx]
test.dat   <- dat[-idx,]        
test.t     <- cl[-idx] 

## build OSC model based on the training data
res <- osc_wise(train.dat, train.t)
names(res)

## pre-process test data by OSC
test.dat.1 <- predict.osc(res,test.dat)$x

Orthogonal Signal Correction (OSC) Approach by Wold et al.

Description

Orthogonal signal correction (OSC) approach by Wold et al.

Usage

  osc_wold(x, y, center=TRUE,osc.ncomp=4,pls.ncomp=10,
           tol=1e-3,iter=20,...)
osc_wold(x, y, center=TRUE,osc.ncomp=4,pls.ncomp=10,
           tol=1e-3,iter=20,...)

Arguments

`x`	A numeric data frame or matrix to be pre-processed.
`y`	A vector or factor specifying the class for each observation.
`center`	A logical value indicating whether the data set should be centred by column-wise.
`osc.ncomp`	The number of components to be used in the OSC calculation.
`pls.ncomp`	The number of components to be used in the PLS calculation.
`tol`	A scalar value of tolerance for OSC computation.
`iter`	The number of iteration used in OSC calculation.
`...`	Arguments passed to or from other methods.

Value

A list containing the following components:

`x`	A matrix of OSC corrected data set.
`R2`	R2 statistics. It is calculated as the fraction of variation in X after OSC correction.
`angle`	An angle used for checking if scores `t` is orthogonal to `y`. An angle close to 90 degree means that orthogonality is achieved in the correction process.
`w`	A matrix of OSC weights.
`p`	A matrix of OSC loadings.
`t`	A matrix of OSC scores.
`center`	A logical value indicating whether the data set has been centred by column-wise.

Author(s)

Wanchang Lin

References

Wold, S., Antti, H., Lindgren, F., Ohman, J.(1998). Orthogonal signal correction of nearinfrared spectra. Chemometrics Intell. Lab. Syst., 44: 175-185.

Svensson, O., Kourti, T. and MacGregor, J.F. (2002). An investigation of orthogonal correction algorithms and their characteristics. Journal of Chemometrics, 16:176-188.

Westerhuis, J. A., de Jong, S., Smilde, A, K. (2001). Direct orthogonal signal correction. Chemometrics Intell. Lab. Syst., 56: 13-25.

Examples

data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- abr1$pos

## divide data as training and test data
idx <- sample(1:nrow(dat), round((2/3)*nrow(dat)), replace=FALSE) 

## construct train and test data 
train.dat  <- dat[idx,]
train.t    <- cl[idx]
test.dat   <- dat[-idx,]        
test.t     <- cl[-idx] 

## build OSC model based on the training data
res <- osc_wold(train.dat, train.t)
names(res)

## pre-process test data by OSC
test.dat.1 <- predict.osc(res,test.dat)$x

data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- abr1$pos

## divide data as training and test data
idx <- sample(1:nrow(dat), round((2/3)*nrow(dat)), replace=FALSE) 

## construct train and test data 
train.dat  <- dat[idx,]
train.t    <- cl[idx]
test.dat   <- dat[-idx,]        
test.t     <- cl[-idx] 

## build OSC model based on the training data
res <- osc_wold(train.dat, train.t)
names(res)

## pre-process test data by OSC
test.dat.1 <- predict.osc(res,test.dat)$x

Panel Function for Plotting Ellipse and outlier

Description

lattice panel functions for plotting grouped ellipse and outlier

Usage

panel.elli(x, y, groups = NULL,conf.level = 0.975, ...)
panel.elli.1(x, y, subscripts, groups=NULL, conf.level = 0.975,
             ep=0, com.grp=NULL, no.grp=NULL, ell.grp=NULL, ...)
panel.outl(x, y, subscripts, groups=NULL, conf.level = 0.975, labs, ...)
panel.elli(x, y, groups = NULL,conf.level = 0.975, ...)
panel.elli.1(x, y, subscripts, groups=NULL, conf.level = 0.975,
             ep=0, com.grp=NULL, no.grp=NULL, ell.grp=NULL, ...)
panel.outl(x, y, subscripts, groups=NULL, conf.level = 0.975, labs, ...)

Arguments

`x`, `y`	Variables to be plotted.
`conf.level`	Confident level for ellipse
`groups`, `subscripts`	Internal parameters for Lattice.
`labs`	Labels for potential outliers.
`ep`	An integer for plotting ellipse. `1` and `2` for plotting overall and group ellipse, respectively. Otherwise, none.
`com.grp`	A list of characters to select which combination of groups to be plotted.
`no.grp`	A list of characters to select which individual group not to be plotted. Note it will be overridden by `com.grp`. If no `com.grp` and `no.grp`, ellipses of each individual group will be plotted.
`ell.grp`	Another categorical vector used for plotting ellipse. If provided, `ep`,`com.grp` and `no.grp` will be ignored. It should be different from default `groups`, but has the same length of `groups`. For details, see example below.
`...`	Further arguments. See corresponding entry in `xyplot` for non-trivial details.

Details

panel.elli is modified from function panel.ellipse in package latticeExtra.

panel.elli.1 gives more control on how to plot ellipse for the current group. It also provides an option to plot ellipse based on another user-defined groups.

panel.outl plots the labels of data points outside the ellipse. These data points can be treated as potential outliers.

Value

Retuns objects of class "trellis".

Note

panel.elli.1 can be called by functions grpplot, pcaplot, mdsplot, pca_plot_wrap, mds_plot_wrap, pls_plot_wrap and lda_plot_wrap by passing argument of ep. See examples of these function for details.

Author(s)

Wanchang Lin

Examples

library(lattice) 
data(iris)

## =====================================================================
## Examples of calling 'panel.elli' and 'panel.outl'
xyplot(Sepal.Length ~ Petal.Length, data = iris, groups=Species,
        par.settings = list(superpose.symbol = list(pch=c(15:17)),
                            superpose.line = list(lwd=2, lty=1:3)),
        panel = function(x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.elli(x, y, ..., type="l",lwd=2)
            panel.outl(x,y, ...)
        },
        auto.key = list(x = .1, y = .8, corner = c(0, 0)),
        labs=rownames(iris), conf.level=0.9,adj = -0.5)

## Without groups
xyplot(Sepal.Length ~ Petal.Length, data = iris,
        par.settings = list(plot.symbol = list(cex = 1.1, pch=16)),
        panel = function(x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.elli(x, y, ..., type="l", lwd = 2)
            panel.outl(x,y, ...)
        },
        auto.key = list(x = .1, y = .8, corner = c(0, 0)),
        labs=rownames(iris), conf.level=0.9,adj = -0.5)


## With conditioning
xyplot(Sepal.Length ~ Petal.Length|Species, data = iris, 
        par.settings = list(plot.symbol = list(cex = 1.1, pch=16)),
        layout=c(2,2),
        panel = function(x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.elli(x, y, ..., type="l", lwd = 2)
            panel.outl(x,y, ...)
        },
        auto.key = list(x = .6, y = .8, corner = c(0, 0)),
        adj = 0,labs=rownames(iris), conf.level=0.95)

## =====================================================================
## Examples of 'panel.elli.1'
xyplot(Sepal.Length ~ Petal.Length, data = iris, groups=Species,
        ## ---------------------------------------------------
        ## Select what to be plotted.
        ep=2,
        ## com.grp = list(a="setosa",b=c("versicolor", "virginica")),
        ## no.grp = "setosa", ## Not draw ellipse for "setosa"
        ## ---------------------------------------------------
        par.settings = list(superpose.symbol = list(pch=c(15:17)),
                            superpose.line = list(lwd=2, lty=1:3)),
        panel = function(x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.elli.1(x, y, ...)
            panel.outl(x,y, ...)
        },
        auto.key = list(points = TRUE, rectangles = FALSE, space = "right"),
        adj = 0,labs=rownames(iris), conf.level=0.95)

xyplot(Sepal.Length ~ Petal.Length, data = iris, groups=Species,
        ## ---------------------------------------------------
        ## Select what to be plotted.
        ep=2,
        ## com.grp = list(a="setosa",b=c("versicolor", "virginica")),
        no.grp = c("setosa","versicolor"),## Only draw "virginica"
        ## ---------------------------------------------------
        par.settings = list(superpose.symbol = list(pch=c(15:17)),
                            superpose.line = list(lwd=2, lty=1:3)),
        panel = function(x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.elli.1(x, y, ...)
        },
        auto.key = list(x = .1, y = .8, corner = c(0, 0)))

xyplot(Sepal.Length ~ Petal.Length, data = iris, groups=Species,
        ## ---------------------------------------------------
        ## Select what to be plotted.
        ep=2,
        com.grp = list(a="setosa",b=c("versicolor", "virginica")),
        ## no.grp = "setosa", ## Not draw ellipse for "setosa"
        ## ---------------------------------------------------
        par.settings = list(superpose.symbol = list(pch=c(15:17)),
                            superpose.line = list(lwd=2, lty=1:3)),
        panel = function(x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.elli.1(x, y, ...)
        },
        auto.key = list(x = .1, y = .8, corner = c(0, 0)))

  xyplot(Sepal.Length ~ Petal.Length, data = iris, groups=Species, ep=1,
        par.settings = list(superpose.symbol = list(pch=c(15:17)),
                            superpose.line = list(lwd=2, lty=1:3)),
        panel = function(x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.elli.1(x, y, ...)
        },
        auto.key = list(points = TRUE, rectangles = FALSE, space = "right"))

## =====================================================================
## Another data set from package MASS
require(MASS)
data(Cars93)

## Plot ellipse based on original groups: DriveTrain
xyplot(Price~EngineSize, data=Cars93, groups=DriveTrain, ep=2,
        par.settings = list(superpose.symbol = list(pch=c(15:17)),
                            superpose.line = list(lwd=2, lty=1:3)),
        panel = function(x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.elli.1(x, y, ...)
        },
        auto.key = list(points = TRUE, rectangles = FALSE, space = "right"))

## But we want to plot ellipse using AirBags
xyplot(Price~EngineSize, data=Cars93, groups=DriveTrain, 
        ell.grp=Cars93$AirBags,
        par.settings = list(superpose.symbol = list(pch=c(15:17)),
                            superpose.line = list(lwd=2, lty=1:3)),
        panel = function(x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.elli.1(x, y, ...)
        },
        auto.key = list(points = TRUE, rectangles = FALSE, space = "right"))

library(lattice) 
data(iris)

## =====================================================================
## Examples of calling 'panel.elli' and 'panel.outl'
xyplot(Sepal.Length ~ Petal.Length, data = iris, groups=Species,
        par.settings = list(superpose.symbol = list(pch=c(15:17)),
                            superpose.line = list(lwd=2, lty=1:3)),
        panel = function(x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.elli(x, y, ..., type="l",lwd=2)
            panel.outl(x,y, ...)
        },
        auto.key = list(x = .1, y = .8, corner = c(0, 0)),
        labs=rownames(iris), conf.level=0.9,adj = -0.5)

## Without groups
xyplot(Sepal.Length ~ Petal.Length, data = iris,
        par.settings = list(plot.symbol = list(cex = 1.1, pch=16)),
        panel = function(x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.elli(x, y, ..., type="l", lwd = 2)
            panel.outl(x,y, ...)
        },
        auto.key = list(x = .1, y = .8, corner = c(0, 0)),
        labs=rownames(iris), conf.level=0.9,adj = -0.5)


## With conditioning
xyplot(Sepal.Length ~ Petal.Length|Species, data = iris, 
        par.settings = list(plot.symbol = list(cex = 1.1, pch=16)),
        layout=c(2,2),
        panel = function(x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.elli(x, y, ..., type="l", lwd = 2)
            panel.outl(x,y, ...)
        },
        auto.key = list(x = .6, y = .8, corner = c(0, 0)),
        adj = 0,labs=rownames(iris), conf.level=0.95)

## =====================================================================
## Examples of 'panel.elli.1'
xyplot(Sepal.Length ~ Petal.Length, data = iris, groups=Species,
        ## ---------------------------------------------------
        ## Select what to be plotted.
        ep=2,
        ## com.grp = list(a="setosa",b=c("versicolor", "virginica")),
        ## no.grp = "setosa", ## Not draw ellipse for "setosa"
        ## ---------------------------------------------------
        par.settings = list(superpose.symbol = list(pch=c(15:17)),
                            superpose.line = list(lwd=2, lty=1:3)),
        panel = function(x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.elli.1(x, y, ...)
            panel.outl(x,y, ...)
        },
        auto.key = list(points = TRUE, rectangles = FALSE, space = "right"),
        adj = 0,labs=rownames(iris), conf.level=0.95)

xyplot(Sepal.Length ~ Petal.Length, data = iris, groups=Species,
        ## ---------------------------------------------------
        ## Select what to be plotted.
        ep=2,
        ## com.grp = list(a="setosa",b=c("versicolor", "virginica")),
        no.grp = c("setosa","versicolor"),## Only draw "virginica"
        ## ---------------------------------------------------
        par.settings = list(superpose.symbol = list(pch=c(15:17)),
                            superpose.line = list(lwd=2, lty=1:3)),
        panel = function(x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.elli.1(x, y, ...)
        },
        auto.key = list(x = .1, y = .8, corner = c(0, 0)))

xyplot(Sepal.Length ~ Petal.Length, data = iris, groups=Species,
        ## ---------------------------------------------------
        ## Select what to be plotted.
        ep=2,
        com.grp = list(a="setosa",b=c("versicolor", "virginica")),
        ## no.grp = "setosa", ## Not draw ellipse for "setosa"
        ## ---------------------------------------------------
        par.settings = list(superpose.symbol = list(pch=c(15:17)),
                            superpose.line = list(lwd=2, lty=1:3)),
        panel = function(x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.elli.1(x, y, ...)
        },
        auto.key = list(x = .1, y = .8, corner = c(0, 0)))

  xyplot(Sepal.Length ~ Petal.Length, data = iris, groups=Species, ep=1,
        par.settings = list(superpose.symbol = list(pch=c(15:17)),
                            superpose.line = list(lwd=2, lty=1:3)),
        panel = function(x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.elli.1(x, y, ...)
        },
        auto.key = list(points = TRUE, rectangles = FALSE, space = "right"))

## =====================================================================
## Another data set from package MASS
require(MASS)
data(Cars93)

## Plot ellipse based on original groups: DriveTrain
xyplot(Price~EngineSize, data=Cars93, groups=DriveTrain, ep=2,
        par.settings = list(superpose.symbol = list(pch=c(15:17)),
                            superpose.line = list(lwd=2, lty=1:3)),
        panel = function(x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.elli.1(x, y, ...)
        },
        auto.key = list(points = TRUE, rectangles = FALSE, space = "right"))

## But we want to plot ellipse using AirBags
xyplot(Price~EngineSize, data=Cars93, groups=DriveTrain, 
        ell.grp=Cars93$AirBags,
        par.settings = list(superpose.symbol = list(pch=c(15:17)),
                            superpose.line = list(lwd=2, lty=1:3)),
        panel = function(x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.elli.1(x, y, ...)
        },
        auto.key = list(points = TRUE, rectangles = FALSE, space = "right"))

Panel Function for Plotting Regression Line

Description

lattice panel function for plotting regression line with red colour.

Usage

panel.smooth.line(x, y,...) 
panel.smooth.line(x, y,...)

Arguments

`x`, `y`	Variables to be plotted.
`...`	Further arguments. See corresponding entry in `xyplot` for non-trivial details.

Value

An object of class "trellis".

Author(s)

Wanchang Lin

Examples

library(lattice) 
data(iris)
splom(~iris[,1:4], varname.cex = 1.0, pscales = 0, panel = panel.smooth.line)
library(lattice) 
data(iris)
splom(~iris[,1:4], varname.cex = 1.0, pscales = 0, panel = panel.smooth.line)

Outlier detection by PCA

Description

Outlier detection by the Mahalanobis distances of PC1 and PC2. Also plot PC1 and PC2 with its confidence ellipse.

Usage

  pca.outlier(x, center = TRUE, scale=TRUE,conf.level = 0.975,...) 

  pca.outlier.1(x, center = TRUE, scale=TRUE, conf.level = 0.975, 
              group=NULL, main = "PCA", cex=0.7,...)
pca.outlier(x, center = TRUE, scale=TRUE,conf.level = 0.975,...) 

  pca.outlier.1(x, center = TRUE, scale=TRUE, conf.level = 0.975, 
              group=NULL, main = "PCA", cex=0.7,...)

Arguments

`x`	A data frame or matrix.
`center`	A logical value indicating whether the variables should be shifted to be zero centred before PCA analysis takes place.
`scale`	A logical value indicating whether the variables should be scaled to have unit variance before PCA analysis takes place.
`conf.level`	The confidence level for controlling the cutoff of the Mahalanobis distances.
`group`	A string character or factor indicating group information of row of `x`. It is used only for plotting.
`main`	An overall title for PCA plot.
`cex`	A numerical value giving the amount by which plotting text and symbols should be magnified relative to the default.
`...`	Further arguments for plotting

Value

A list with components:

`plot`	plot object of class `"trellis"` by `pca.outlier` only.
`outlier`	Outliers detected.
`conf.level`	Confidence level used.
`mah.dist`	Mahalanobis distances of each data sample.
`cutoff`	Cutoff of Mahalanobis distances used for outlier detection.

Note

Examples of panel.elli and panel.outl give more general information about ellipses and outliers. If you ONLY want to plot outliers based on PCA in a general way, for example, outliers in different groups or in conditional panel, you can write an wrapper function combining with pca.comp, panel.elli and panel.outl. It is quite similiar to the implementation of pca_plot_wrap.

Author(s)

Wanchang Lin

Examples

  data(iris)

  ## call lattice version
  pca.outlier(iris[,1:4], adj=-0.5)
  ## plot group
  pca.outlier(iris[,1:4], adj=-0.5,groups=iris[,5])
  ## more information about groups
  pca.outlier(iris[,1:4],groups=iris[,5],adj = -0.5, xlim=c(-5, 5),
                auto.key = list(x = .05, y = .9, corner = c(0, 0)),
                par.settings = list(superpose.symbol=list(pch=rep(1:25))))

  ## call basic graphic version
  pca.outlier.1(iris[,1:4])
  ## plot group
  pca.outlier.1(iris[,1:4], group=iris[,5])

data(iris)

  ## call lattice version
  pca.outlier(iris[,1:4], adj=-0.5)
  ## plot group
  pca.outlier(iris[,1:4], adj=-0.5,groups=iris[,5])
  ## more information about groups
  pca.outlier(iris[,1:4],groups=iris[,5],adj = -0.5, xlim=c(-5, 5),
                auto.key = list(x = .05, y = .9, corner = c(0, 0)),
                par.settings = list(superpose.symbol=list(pch=rep(1:25))))

  ## call basic graphic version
  pca.outlier.1(iris[,1:4])
  ## plot group
  pca.outlier.1(iris[,1:4], group=iris[,5])

Classification with PCADA

Description

Classification with combination of principal component analysis (PCA) and linear discriminant analysis (LDA).

Usage

pcalda(x, ...)

## Default S3 method:
pcalda(x, y, center = TRUE, scale. = FALSE, ncomp = NULL,
       tune=FALSE,...)

## S3 method for class 'formula'
pcalda(formula, data = NULL, ..., subset, na.action = na.omit)
pcalda(x, ...)

## Default S3 method:
pcalda(x, y, center = TRUE, scale. = FALSE, ncomp = NULL,
       tune=FALSE,...)

## S3 method for class 'formula'
pcalda(formula, data = NULL, ..., subset, na.action = na.omit)

Arguments

`formula`	A formula of the form `groups ~ x1 + x2 + ...` That is, the response is the grouping factor and the right hand side specifies the (non-factor) discriminators.
`data`	Data frame from which variables specified in `formula` are preferentially to be taken.
`x`	A matrix or data frame containing the explanatory variables if no formula is given as the principal argument.
`y`	A factor specifying the class for each observation if no formula principal argument is given.
`center`	A logical value indicating whether `x` should be shifted to zero centred by column-wise.
`scale.`	A logical value indicating whether `x` should be scaled to have unit variance by column-wise before the analysis takes place.
`ncomp`	The number of principal components to be used in the classification. If `NULL` and `tune=TRUE`, it is the row number of `x` minus the number of class indicating in `y`. If `NULL` and `tune=FALSE`, it is the half of row number of `x`.
`tune`	A logical value indicating whether the best number of components should be tuned.
`...`	Arguments passed to or from other methods.
`subset`	An index vector specifying the cases to be used in the training sample.
`na.action`	A function to specify the action to be taken if `NA`s are found. The default action is `na.omit`, which leads to rejection of cases with missing values on any required variable. An alternative is `na.fail`, which causes an error if `NA` cases are found.

Details

A critical issue of applying linear discriminant analysis (LDA) is both the singularity and instability of the within-class scatter matrix. In practice, there are often a large number of features available, but the total number of training patterns is limited and commonly less than the dimension of the feature space. To tackle this issue, pcalda combines PCA and LDA for classification. It uses PCA for dimension reduction. The rotated data resulted from PCA will be the input variable to LDA for classification.

Value

An object of class pcalda containing the following components:

`x`	The rotated data on discriminant variables.
`cl`	The observed class labels of training data.
`pred`	The predicted class labels of training data.
`posterior`	The posterior probabilities for the predicted classes.
`conf`	The confusion matrix based on training data.
`acc`	The accuracy rate of training data.
`ncomp`	The number of principal components used for classification.
`pca.out`	The output of PCA.
`lda.out`	The output of LDA.
`call`	The (matched) function call.

Note

This function may be called giving either a formula and optional data frame, or a matrix and grouping factor as the first two arguments.

Author(s)

Wanchang Lin

Examples

data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- abr1$pos

## divide data as training and test data
idx <- sample(1:nrow(dat), round((2/3)*nrow(dat)), replace=FALSE) 

## construct train and test data 
train.dat  <- dat[idx,]
train.t    <- cl[idx]
test.dat   <- dat[-idx,]        
test.t     <- cl[-idx] 

## apply pcalda
model    <- pcalda(train.dat,train.t)
model
summary(model)

## plot
plot(model,dimen=c(1,2),main = "Training data",abbrev = TRUE)
plot(model,main = "Training data",abbrev = TRUE)

## confusion matrix
pred.te  <- predict(model, test.dat)$class
table(test.t,pred.te)

data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- abr1$pos

## divide data as training and test data
idx <- sample(1:nrow(dat), round((2/3)*nrow(dat)), replace=FALSE) 

## construct train and test data 
train.dat  <- dat[idx,]
train.t    <- cl[idx]
test.dat   <- dat[-idx,]        
test.t     <- cl[-idx] 

## apply pcalda
model    <- pcalda(train.dat,train.t)
model
summary(model)

## plot
plot(model,dimen=c(1,2),main = "Training data",abbrev = TRUE)
plot(model,main = "Training data",abbrev = TRUE)

## confusion matrix
pred.te  <- predict(model, test.dat)$class
table(test.t,pred.te)

Plot Function for PCA with Grouped Values

Description

Plot function for PCA with grouped values.

Usage

pcaplot(x, y, scale = TRUE, pcs = 1:2, ...)

pca.plot(x, y, scale=TRUE, abbrev = FALSE, ep.plot=FALSE,...)

pca.comp(x, scale=FALSE, pcs=1:2,...)
pcaplot(x, y, scale = TRUE, pcs = 1:2, ...)

pca.plot(x, y, scale=TRUE, abbrev = FALSE, ep.plot=FALSE,...)

pca.comp(x, scale=FALSE, pcs=1:2,...)

Arguments

`x`	A matrix or data frame to be plotted.
`y`	A factor or vector giving group information of columns of `x`.
`scale`	A logical value indicating whether the data set `x` should be scaled.
`pcs`	A vector of index of PCs to be plotted.
`ep.plot`	A logical value indicating whether the ellipse should be plotted.
`abbrev`	Whether the group labels are abbreviated on the plots. If `abbrev > 0` this gives `minlength` in the call to `abbreviate`.
`...`	Further arguments to `prcomp` or `lattice`. See corresponding entry in `xyplot` for non-trivial details of `lattice`. For `pcaplot`, one argument is `ep`: an integer for plotting ellipse. `1` and `2` for plotting overall and group ellipse, respectively. Otherwise, none. For details, see `panel.elli.1`.

Value

pcaplot returns an object of class "trellis".

pca.comp returns a list with components:

`scores`	PCA scores
`vars`	Proportion of variance
`varsn`	A vector of string indicating the percentage of variance.

Note

Number of columns of x must be larger than 1. pcaplot uses lattice to plot PCA while pca.plot uses the basic graphics to do so. pca.plot plots PC1 and PC2 only.

Author(s)

Wanchang Lin

Examples

## examples of 'pcaplot'
data(iris)
pcaplot(iris[,1:4], iris[,5],pcs=c(2,1),ep=2)
## change confidence interval (see 'panel.elli.1')
pcaplot(iris[,1:4], iris[,5],pcs=c(1,2),ep=2, conf.level = 0.9)

pcaplot(iris[,1:4], iris[,5],pcs=c(2,1),ep=1,
        auto.key=list(space="top", columns=3))
pcaplot(iris[,1:4], iris[,5],pcs=c(1,3,4))
tmp <- pcaplot(iris[,1:4], iris[,5],pcs=1:3,ep=2)
tmp

## change symbol's color, type and size
pcaplot(iris[,1:4], iris[,5],pcs=c(2,1),main="IRIS DATA", cex=1.2,
  auto.key=list(space="right", col=c("black","blue","red"), cex=1.2),
  par.settings = list(superpose.symbol = list(col=c("black","blue","red"),
                                              pch=c(1:3))))

## compare pcaplot and pca.plot. 
pcaplot(iris[,1:4], iris[,5],pcs=c(1,2),ep=2)
pca.plot(iris[,1:4], iris[,5], ep.plot = TRUE)

## an example of 'pca.comp'
pca.comp(iris[,1:4], scale = TRUE, pcs=1:3)

## examples of 'pcaplot'
data(iris)
pcaplot(iris[,1:4], iris[,5],pcs=c(2,1),ep=2)
## change confidence interval (see 'panel.elli.1')
pcaplot(iris[,1:4], iris[,5],pcs=c(1,2),ep=2, conf.level = 0.9)

pcaplot(iris[,1:4], iris[,5],pcs=c(2,1),ep=1,
        auto.key=list(space="top", columns=3))
pcaplot(iris[,1:4], iris[,5],pcs=c(1,3,4))
tmp <- pcaplot(iris[,1:4], iris[,5],pcs=1:3,ep=2)
tmp

## change symbol's color, type and size
pcaplot(iris[,1:4], iris[,5],pcs=c(2,1),main="IRIS DATA", cex=1.2,
  auto.key=list(space="right", col=c("black","blue","red"), cex=1.2),
  par.settings = list(superpose.symbol = list(col=c("black","blue","red"),
                                              pch=c(1:3))))

## compare pcaplot and pca.plot. 
pcaplot(iris[,1:4], iris[,5],pcs=c(1,2),ep=2)
pca.plot(iris[,1:4], iris[,5], ep.plot = TRUE)

## an example of 'pca.comp'
pca.comp(iris[,1:4], scale = TRUE, pcs=1:3)

Plot Method for Class 'accest'

Description

Plot accuracy rate of each iteration.

Usage

## S3 method for class 'accest'
plot(x, main = NULL, xlab = NULL, ylab = NULL, ...)
## S3 method for class 'accest'
plot(x, main = NULL, xlab = NULL, ylab = NULL, ...)

Arguments

`x`	An object of class `accest`.
`main`	An overall title for the plot.
`xlab`	A title for the x axis.
`ylab`	A title for the y axis.
`...`	Additional arguments to the plot.

Details

This function is a method for the generic function plot() for class accest. It plots the accuracy rate against the index of iterations.

Value

Returns plot of class accest.

Author(s)

Wanchang Lin

Examples

# Iris data
data(iris)
# Stratified cross-validation of PCALDA for Iris data
pars <- valipars(sampling="cv", niter=10, nreps=10, strat=TRUE)
acc  <- accest(Species~., data = iris, method = "pcalda", pars = pars)
              
acc
summary(acc)
plot(acc)
# Iris data
data(iris)
# Stratified cross-validation of PCALDA for Iris data
pars <- valipars(sampling="cv", niter=10, nreps=10, strat=TRUE)
acc  <- accest(Species~., data = iris, method = "pcalda", pars = pars)
              
acc
summary(acc)
plot(acc)

Plot Method for Class 'maccest'

Description

Plot accuracy rate with standard derivation of each classifier.

Usage

## S3 method for class 'maccest'
plot(x, main = NULL, xlab = NULL, ylab = NULL, ...)
## S3 method for class 'maccest'
plot(x, main = NULL, xlab = NULL, ylab = NULL, ...)

Arguments

`x`	An object of class `maccest`.
`main`	An overall title for the plot.
`xlab`	A title for the x axis.
`ylab`	A title for the y axis.
`...`	Additional arguments to the plot.

Details

This function is a method for the generic function plot() for class maccest. It plots the accuracy rate with standard derivation against the classifiers.

Value

Returns plot of class maccest.

Author(s)

Wanchang Lin

Examples

# Iris data
data(iris)
x      <- subset(iris, select = -Species)
y      <- iris$Species

method <- c("randomForest","svm","pcalda","knn")
pars   <- valipars(sampling="boot", niter = 10, nreps=4)
res    <- maccest(x, y, method=method, pars=pars,
                  comp="anova",kernel="linear") 

res
plot(res)
# Iris data
data(iris)
x      <- subset(iris, select = -Species)
y      <- iris$Species

method <- c("randomForest","svm","pcalda","knn")
pars   <- valipars(sampling="boot", niter = 10, nreps=4)
res    <- maccest(x, y, method=method, pars=pars,
                  comp="anova",kernel="linear") 

res
plot(res)

Plot Method for Class 'pcalda'

Description

Plot linear discriminants of pcalda.

Usage

## S3 method for class 'pcalda'
plot(x, dimen, ...)
## S3 method for class 'pcalda'
plot(x, dimen, ...)

Arguments

`x`	An object of class `pcalda`.
`dimen`	The index of linear discriminants to be used for the plot.
`...`	Further arguments. See corresponding entry in `xyplot` for non-trivial details. One argument is `ep`: an integer for plotting ellipse. `1` and `2` for plotting overall and group ellipse, respectively. Otherwise, none. For details, see `panel.elli.1`.

Details

This function is a method for the generic function plot() for class pcalda. If the length of dimen is greater than 2, a pairs plot is used. If the length of dimen is equal to 2, a scatter plot is drawn. Otherwise, the dot plot is drawn for the single component.

Value

An object of class "trellis".

Author(s)

Wanchang Lin

Examples

data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- abr1$pos

model <- pcalda(dat,cl)

## Second component versus first
plot(model,dimen=c(1,2),main = "Training data",ep=2)
## Pairwise scatterplots of several components 
plot(model,main = "Training data",ep=1)

## The first component
plot(model,dimen=c(1),main = "Training data")
data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- abr1$pos

model <- pcalda(dat,cl)

## Second component versus first
plot(model,dimen=c(1,2),main = "Training data",ep=2)
## Pairwise scatterplots of several components 
plot(model,main = "Training data",ep=1)

## The first component
plot(model,dimen=c(1),main = "Training data")

Plot Method for Class 'plsc' or 'plslda'

Description

Plot latent components of plsc or plslda.

Usage

## S3 method for class 'plsc'
plot(x, dimen, ...)

## S3 method for class 'plslda'
plot(x, dimen, ...)
## S3 method for class 'plsc'
plot(x, dimen, ...)

## S3 method for class 'plslda'
plot(x, dimen, ...)

Arguments

`x`	An object of class `plsc` or `plslda`.
`dimen`	The index of latent components to be used for the plot.
`...`	Further arguments. See corresponding entry in `xyplot` for non-trivial details. One argument is `ep`: an integer for plotting ellipse. `1` and `2` for plotting overall and group ellipse, respectively. Otherwise, none. For details, see `panel.elli.1`.

Details

Two functions are methods for the generic function plot() of class plsc and plslda.

If the length of dimen is greater than 2, a pairs plot is used. If the length of dimen is equal to 2, a scatter plot is drawn. Otherwise, the dot plot is drawn for the single component.

Value

An object of class "trellis".

Author(s)

Wanchang Lin

Examples

data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- abr1$pos

mod.plsc    <- plsc(dat,cl,ncomp=4)
mod.plslda  <- plslda(dat,cl,ncomp=4)

## Second component versus first
plot(mod.plsc,dimen=c(1,2),main = "Training data", ep = 2)
plot(mod.plslda,dimen=c(1,2),main = "Training data", ep = 2)

## Pairwise scatterplots of several components 
plot(mod.plsc, main = "Training data", ep = 1)
plot(mod.plslda, main = "Training data", ep = 1)

## single component
plot(mod.plsc,dimen=c(1),main = "Training data")
plot(mod.plslda,dimen=c(1),main = "Training data")

data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- abr1$pos

mod.plsc    <- plsc(dat,cl,ncomp=4)
mod.plslda  <- plslda(dat,cl,ncomp=4)

## Second component versus first
plot(mod.plsc,dimen=c(1,2),main = "Training data", ep = 2)
plot(mod.plslda,dimen=c(1,2),main = "Training data", ep = 2)

## Pairwise scatterplots of several components 
plot(mod.plsc, main = "Training data", ep = 1)
plot(mod.plslda, main = "Training data", ep = 1)

## single component
plot(mod.plsc,dimen=c(1),main = "Training data")
plot(mod.plslda,dimen=c(1),main = "Training data")

Classification with PLSDA

Description

Classification with partial least squares (PLS) or PLS plus linear discriminant analysis (LDA).

Usage

plsc(x, ...)

plslda(x, ...)

## Default S3 method:
plsc(x, y, pls="simpls",ncomp=10, tune=FALSE,...)

## S3 method for class 'formula'
plsc(formula, data = NULL, ..., subset, na.action = na.omit)

## Default S3 method:
plslda(x, y, pls="simpls",ncomp=10, tune=FALSE,...)

## S3 method for class 'formula'
plslda(formula, data = NULL, ..., subset, na.action = na.omit)
plsc(x, ...)

plslda(x, ...)

## Default S3 method:
plsc(x, y, pls="simpls",ncomp=10, tune=FALSE,...)

## S3 method for class 'formula'
plsc(formula, data = NULL, ..., subset, na.action = na.omit)

## Default S3 method:
plslda(x, y, pls="simpls",ncomp=10, tune=FALSE,...)

## S3 method for class 'formula'
plslda(formula, data = NULL, ..., subset, na.action = na.omit)

Arguments

`formula`	A formula of the form `groups ~ x1 + x2 + ...` That is, the response is the grouping factor and the right hand side specifies the (non-factor) discriminators.
`data`	Data frame from which variables specified in `formula` are preferentially to be taken.
`x`	A matrix or data frame containing the explanatory variables if no formula is given as the principal argument.
`y`	A factor specifying the class for each observation if no formula principal argument is given.
`pls`	A method for calculating PLS scores and loadings. The following methods are supported: `simpls:` SIMPLS algorithm. `kernelpls:` kernel algorithm. `oscorespls:` orthogonal scores algorithm. For details, see `simpls.fit`, `kernelpls.fit` and `oscorespls.fit` in package pls.
`ncomp`	The number of components to be used in the classification.
`tune`	A logical value indicating whether the best number of components should be tuned.
`...`	Arguments passed to or from other methods.
`subset`	An index vector specifying the cases to be used in the training sample.
`na.action`	A function to specify the action to be taken if `NA`s are found. The default action is `na.omit`, which leads to rejection of cases with missing values on any required variable. An alternative is `na.fail`, which causes an error if `NA` cases are found.

Details

plcs implements PLS for classification. In details, the categorical response vector y is converted into a numeric matrix for regression by PLS and the output of PLS is convert to posteriors by softmax method. The classification results are obtained based on the posteriors. plslda combines PLS and LDA for classification, in which, PLS is for dimension reduction and LDA is for classification based on the data transformed by PLS.

Three PLS functions,simpls.fit, kernelpls.fit and oscorespls.fit, are implemented in package pls.

Value

An object of class plsc or plslda containing the following components:

`x`	A matrix of the latent components or scores from PLS.
`cl`	The observed class labels of training data.
`pred`	The predicted class labels of training data.
`conf`	The confusion matrix based on training data.
`acc`	The accuracy rate of training data.
`posterior`	The posterior probabilities for the predicted classes.
`ncomp`	The number of latent component used for classification.
`pls.method`	The PLS algorithm used.
`pls.out`	The output of PLS.
`lda.out`	The output of LDA used only by `plslda`.
`call`	The (matched) function call.

Note

Two functions may be called giving either a formula and optional data frame, or a matrix and grouping factor as the first two arguments.

Author(s)

Wanchang Lin

References

Martens, H. and Nas, T. (1989) Multivariate calibration. John Wiley & Sons.

Examples

library(pls)  
data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- preproc(abr1$pos , y=cl, method=c("log10"),add=1)[,110:500]

## divide data as training and test data
idx <- sample(1:nrow(dat), round((2/3)*nrow(dat)), replace=FALSE) 

## construct train and test data 
train.dat  <- dat[idx,]
train.t    <- cl[idx]
test.dat   <- dat[-idx,]        
test.t     <- cl[-idx] 

## apply plsc and plslda
(res   <- plsc(train.dat,train.t, ncomp = 20, tune = FALSE))
## Estimate the mean squared error of prediction (MSEP), root mean squared error
## of prediction (RMSEP) and R^2 (coefficient of multiple determination) for 
## fitted PLSR model 
MSEP(res$pls.out)
RMSEP(res$pls.out)
R2(res$pls.out)

(res.1  <- plslda(train.dat,train.t, ncomp = 20, tune = FALSE))
## Estimate the mean squared error of prediction (MSEP), root mean squared error
## of prediction (RMSEP) and R^2 (coefficient of multiple determination) for 
## fitted PLSR model 
MSEP(res.1$pls.out)
RMSEP(res.1$pls.out)
R2(res.1$pls.out)

## Not run: 
## with function of tuning component numbers
(z.plsc   <- plsc(train.dat,train.t, ncomp = 20, tune = TRUE))
(z.plslda <- plslda(train.dat,train.t, ncomp = 20, tune = TRUE))

## check nomp tuning results
z.plsc$ncomp
plot(z.plsc$acc.tune)
z.plslda$ncomp
plot(z.plslda$acc.tune)

## plot
plot(z.plsc,dimen=c(1,2,3),main = "Training data",abbrev = TRUE)
plot(z.plslda,dimen=c(1,2,3),main = "Training data",abbrev = TRUE)

## predict test data
pred.plsc   <- predict(z.plsc, test.dat)$class
pred.plslda <- predict(z.plslda, test.dat)$class

## classification rate and confusion matrix
cl.rate(test.t, pred.plsc)
cl.rate(test.t, pred.plslda)


## End(Not run)
library(pls)  
data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- preproc(abr1$pos , y=cl, method=c("log10"),add=1)[,110:500]

## divide data as training and test data
idx <- sample(1:nrow(dat), round((2/3)*nrow(dat)), replace=FALSE) 

## construct train and test data 
train.dat  <- dat[idx,]
train.t    <- cl[idx]
test.dat   <- dat[-idx,]        
test.t     <- cl[-idx] 

## apply plsc and plslda
(res   <- plsc(train.dat,train.t, ncomp = 20, tune = FALSE))
## Estimate the mean squared error of prediction (MSEP), root mean squared error
## of prediction (RMSEP) and R^2 (coefficient of multiple determination) for 
## fitted PLSR model 
MSEP(res$pls.out)
RMSEP(res$pls.out)
R2(res$pls.out)

(res.1  <- plslda(train.dat,train.t, ncomp = 20, tune = FALSE))
## Estimate the mean squared error of prediction (MSEP), root mean squared error
## of prediction (RMSEP) and R^2 (coefficient of multiple determination) for 
## fitted PLSR model 
MSEP(res.1$pls.out)
RMSEP(res.1$pls.out)
R2(res.1$pls.out)

## Not run: 
## with function of tuning component numbers
(z.plsc   <- plsc(train.dat,train.t, ncomp = 20, tune = TRUE))
(z.plslda <- plslda(train.dat,train.t, ncomp = 20, tune = TRUE))

## check nomp tuning results
z.plsc$ncomp
plot(z.plsc$acc.tune)
z.plslda$ncomp
plot(z.plslda$acc.tune)

## plot
plot(z.plsc,dimen=c(1,2,3),main = "Training data",abbrev = TRUE)
plot(z.plslda,dimen=c(1,2,3),main = "Training data",abbrev = TRUE)

## predict test data
pred.plsc   <- predict(z.plsc, test.dat)$class
pred.plslda <- predict(z.plslda, test.dat)$class

## classification rate and confusion matrix
cl.rate(test.t, pred.plsc)
cl.rate(test.t, pred.plslda)


## End(Not run)

Predict Method for Class 'osc'

Description

Pre-processing of new data by osc.

Usage

## S3 method for class 'osc'
predict(object, newdata,...)
## S3 method for class 'osc'
predict(object, newdata,...)

Arguments

`object`	Object of class `osc`.
`newdata`	A matrix or data frame of cases to be corrected by OSC.
`...`	Arguments based from or to other methods.

Details

This function is a method for the generic function predict() for class osc. If newdata is omitted, the corrected data set used in model of osc will be returned.

Value

A list containing the following components:

`x`	A matrix of OSC corrected data set.
`Q2`	The fraction of variation in X after OSC correction for the new data.

Author(s)

Wanchang Lin

Examples

data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- abr1$pos

## divide data as training and test data
idx <- sample(1:nrow(dat), round((2/3)*nrow(dat)), replace=FALSE) 

## construct train and test data 
train.dat  <- dat[idx,]
train.t    <- cl[idx]
test.dat   <- dat[-idx,]        
test.t     <- cl[-idx] 

## build OSC model based on the training data
res <- osc(train.dat, train.t, method="wold",osc.ncomp=2, pls.ncomp=4)
names(res)
res
summary(res)

## pre-process test data by OSC
test <- predict(res,test.dat)
test.dat.1 <- test$x
data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- abr1$pos

## divide data as training and test data
idx <- sample(1:nrow(dat), round((2/3)*nrow(dat)), replace=FALSE) 

## construct train and test data 
train.dat  <- dat[idx,]
train.t    <- cl[idx]
test.dat   <- dat[-idx,]        
test.t     <- cl[-idx] 

## build OSC model based on the training data
res <- osc(train.dat, train.t, method="wold",osc.ncomp=2, pls.ncomp=4)
names(res)
res
summary(res)

## pre-process test data by OSC
test <- predict(res,test.dat)
test.dat.1 <- test$x

Predict Method for Class 'pcalda'

Description

Prediction of test data using pcalda.

Usage

## S3 method for class 'pcalda'
predict(object, newdata,...)
## S3 method for class 'pcalda'
predict(object, newdata,...)

Arguments

`object`	Object of class `pcalda`.
`newdata`	A matrix or data frame of cases to be classified.
`...`	Arguments based from or to other methods.

Details

This function is a method for the generic function predict() for class pcalda. If newdata is omitted, the results of training data in pcalda object will be returned.

Value

A list with components:

`class`	The predicted class (a factor).
`posterior`	The posterior probabilities for the predicted classes.
`x`	The rotated test data by the projection matrix of LDA.

Author(s)

Wanchang Lin

Examples

data(iris3)

tr    <- sample(1:50, 25)
train <- rbind(iris3[tr,,1], iris3[tr,,2], iris3[tr,,3])
test  <- rbind(iris3[-tr,,1], iris3[-tr,,2], iris3[-tr,,3])
cl    <- factor(c(rep("s",25), rep("c",25), rep("v",25)))

z     <- pcalda(train, cl)
pred  <- predict(z, test)

## plot the projected data.
grpplot(pred$x, pred$class, main="PCALDA: Iris") 
data(iris3)

tr    <- sample(1:50, 25)
train <- rbind(iris3[tr,,1], iris3[tr,,2], iris3[tr,,3])
test  <- rbind(iris3[-tr,,1], iris3[-tr,,2], iris3[-tr,,3])
cl    <- factor(c(rep("s",25), rep("c",25), rep("v",25)))

z     <- pcalda(train, cl)
pred  <- predict(z, test)

## plot the projected data.
grpplot(pred$x, pred$class, main="PCALDA: Iris")

Predict Method for Class 'plsc' or 'plslda'

Description

Prediction of test data using plsc or plslda.

Usage

## S3 method for class 'plsc'
predict(object, newdata,...)
## S3 method for class 'plslda'
predict(object, newdata,...)
## S3 method for class 'plsc'
predict(object, newdata,...)
## S3 method for class 'plslda'
predict(object, newdata,...)

Arguments

`object`	Object of class `plsc` or `plslda`.
`newdata`	A matrix or data frame of cases to be classified.
`...`	Arguments based from or to other methods.

Details

Two functions are methods for the generic function predict() for class plsc or plslda. If newdata is omitted, the results of training data in plsc or plslda object will be returned.

Value

A list with components:

`class`	The predicted class (a factor).
`posterior`	The posterior probabilities for the predicted classes.
`x`	The rotated test data by the projection matrix of PLS.

Author(s)

Wanchang Lin

Examples

data(iris3)

tr    <- sample(1:50, 25)
train <- rbind(iris3[tr,,1], iris3[tr,,2], iris3[tr,,3])
test  <- rbind(iris3[-tr,,1], iris3[-tr,,2], iris3[-tr,,3])
cl    <- factor(c(rep("s",25), rep("c",25), rep("v",25)))

## model fit using plsc and plslda without tuning of ncomp
(z.plsc       <- plsc(train, cl))
(z.plslda     <- plslda(train, cl))
## predict for test data
pred.plsc    <- predict(z.plsc, test)
pred.plslda  <- predict(z.plslda, test)

## plot the projected test data.
grpplot(pred.plsc$x, pred.plsc$class, main="PLSC: Iris") 
grpplot(pred.plslda$x, pred.plslda$class, main="PLSLDA: Iris") 
data(iris3)

tr    <- sample(1:50, 25)
train <- rbind(iris3[tr,,1], iris3[tr,,2], iris3[tr,,3])
test  <- rbind(iris3[-tr,,1], iris3[-tr,,2], iris3[-tr,,3])
cl    <- factor(c(rep("s",25), rep("c",25), rep("v",25)))

## model fit using plsc and plslda without tuning of ncomp
(z.plsc       <- plsc(train, cl))
(z.plslda     <- plslda(train, cl))
## predict for test data
pred.plsc    <- predict(z.plsc, test)
pred.plslda  <- predict(z.plslda, test)

## plot the projected test data.
grpplot(pred.plsc$x, pred.plsc$class, main="PLSC: Iris") 
grpplot(pred.plslda$x, pred.plslda$class, main="PLSLDA: Iris")

Pre-process Data Set

Description

Pre-process a data frame or matrix by different methods.

Usage

preproc (x, y=NULL,method="log",add=1)

preproc.sd(x, y=NULL, na.rm = FALSE)

preproc.const(x, y, tol = 1.0e-4)
preproc (x, y=NULL,method="log",add=1)

preproc.sd(x, y=NULL, na.rm = FALSE)

preproc.const(x, y, tol = 1.0e-4)

Arguments

`x`	A numeric data frame or matrix to be pre-processed.
`y`	A factor specifying the group. It is only used by the method `TICnorm` in `preproc`.
`method`	A method used to pre-process the data set. The following methods are supported: `center:` Centering `auto:` Auto scaling `range:` Range scaling `pareto:` Pareto scaling `vast:` Vast scaling `level:` Level scaling `log:` Log transformation (default) `log10:` Log 10 transformation `sqrt:` Square root transformation `asinh:` Inverse hyperbolic sine transformation `TICnorm:` TIC normalisation
`na.rm`	A logical value indicating whether NA values should be stripped before the computation proceeds.
`add`	A numeric value for addition used in the logarmath transformation `log` and `log10`.
`tol`	A tolerance to decide if a matrix is singular; it will reject variables and linear combinations of unit-variance variables whose variance is less than tol^2.

Details

preproc transforms data set by provided method.

preproc.sd removes variables which have (near) zero S.D with or without respect to class/grouped information.

preproc.const removes variables appears to be constant within groups / classes.

Value

A pre-processed data set.

Author(s)

Wanchang Lin

References

Berg, R., Hoefsloot, H., Westerhuis, J., Smilde, A. and Werf, M. (2006), Centering, scaling, and transformations: improving the biological information content of metabolomics data, BMC Genomics, 7:142

Examples

data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- abr1$pos

## normalise data set using "TICnorm"
z.1  <- preproc(dat, y=cl, method="TICnorm")

## scale data set using "log10"
z.2 <- preproc(dat,method="log10", add=1)

## or scale data set using "log" and "TICnorm" sequentially
z.3 <- preproc(dat,method=c("log","TICnorm"), add=0.1)

data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- abr1$pos

## normalise data set using "TICnorm"
z.1  <- preproc(dat, y=cl, method="TICnorm")

## scale data set using "log10"
z.2 <- preproc(dat,method="log10", add=1)

## or scale data set using "log" and "TICnorm" sequentially
z.3 <- preproc(dat,method=c("log","TICnorm"), add=0.1)

P-values Utilities

Description

Functions to handle p-values of data set.

Usage

  pval.test(x,y, method="oneway.test",...)

  pval.reject(adjp,alpha)

pval.test(x,y, method="oneway.test",...)

  pval.reject(adjp,alpha)

Arguments

`x`	A data frame or matrix of data set.
`y`	A factor or vector of class.
`method`	Hypothesis test such as `t.test` and `wilcox.test`.
`adjp`	A matrix-like p-values of simultaneously testing.
`alpha`	A vector of cutoff of p-values or Type I error rate.
`...`	Arguments to pass.

Value

pval.test returns a vector of p-values.

pval.reject returns a matrix indicating rejected number according to cutoff.

Author(s)

Wanchang Lin

Examples

library(lattice)

## Example for pval.test and pval.reject
## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos[,200:500]
dat <- preproc(dat, method="log")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
dat <- dat[ind,,drop=FALSE] 
cls <- cls[ind, drop=TRUE]   

## univariate p-values and its adjusted p-values
pval <- sort(pval.test(dat, cls, method="t.test"))

## adjust p-values
pval.ad <- sapply(c("fdr","bonferroni","BY"), function(y){
  p.adjust(pval, method=y)
})
pval.ad <- cbind(raw=pval, pval.ad)
pval.reject(pval.ad,c(0.005, 0.01, 0.05))

## plot the all p-values
tmp <- cbind(pval.ad, idx=1:nrow(pval.ad))
tmp <- data.frame(tmp)

# pval_long <- melt(tmp, id="idx")
pval_long <- data.frame(idx = tmp$idx, stack(tmp, select = -idx))
pval_long <- pval_long[c("idx", "ind", "values")]
names(pval_long) <- c("idx", "variable", "value")

pval.p <- xyplot(value~idx, data=pval_long, groups=variable,
                par.settings = list(superpose.line = list(lty=c(1:7))),
                as.table = TRUE, type="l", 
                par.strip.text = list(cex=0.65), ylim=c(-0.005, 1.0),
                ylab="P-values", xlab="Index of variables",
                main="p-values",
                auto.key = list(lines=TRUE, points = FALSE,space="right"),
                panel = function(x, y,...) {
                  panel.xyplot(x, y, ...)
                  panel.abline(h = 0.05, col = "red",lty =2)
                })
pval.p

library(lattice)

## Example for pval.test and pval.reject
## prepare data set
data(abr1)
cls <- factor(abr1$fact$class)
dat <- abr1$pos[,200:500]
dat <- preproc(dat, method="log")

## select class "1" and "2" for feature ranking
ind <- grepl("1|2", cls)
dat <- dat[ind,,drop=FALSE] 
cls <- cls[ind, drop=TRUE]   

## univariate p-values and its adjusted p-values
pval <- sort(pval.test(dat, cls, method="t.test"))

## adjust p-values
pval.ad <- sapply(c("fdr","bonferroni","BY"), function(y){
  p.adjust(pval, method=y)
})
pval.ad <- cbind(raw=pval, pval.ad)
pval.reject(pval.ad,c(0.005, 0.01, 0.05))

## plot the all p-values
tmp <- cbind(pval.ad, idx=1:nrow(pval.ad))
tmp <- data.frame(tmp)

# pval_long <- melt(tmp, id="idx")
pval_long <- data.frame(idx = tmp$idx, stack(tmp, select = -idx))
pval_long <- pval_long[c("idx", "ind", "values")]
names(pval_long) <- c("idx", "variable", "value")

pval.p <- xyplot(value~idx, data=pval_long, groups=variable,
                par.settings = list(superpose.line = list(lty=c(1:7))),
                as.table = TRUE, type="l", 
                par.strip.text = list(cex=0.65), ylim=c(-0.005, 1.0),
                ylab="P-values", xlab="Index of variables",
                main="p-values",
                auto.key = list(lines=TRUE, points = FALSE,space="right"),
                panel = function(x, y,...) {
                  panel.xyplot(x, y, ...)
                  panel.abline(h = 0.05, col = "red",lty =2)
                })
pval.p

Save List of Data Frame or Matrix into CSV File

Description

Save a list of data frame or matrix into a CSV file

Usage

  save.tab(x, filename="temp.csv", firstline="\n")
save.tab(x, filename="temp.csv", firstline="\n")

Arguments

`x`	A list of data frame or matrix.
`filename`	A character string for saved file name.
`firstline`	A string giving some description of the saved file.

Details

This function gives a quick option to save a set of data frame or matrix into a single table file.

Value

No return value, called for side effects.

Author(s)

Wanchang Lin

Examples

## Not run: 
data(abr1)
dat <- preproc(abr1$pos[,200:400], method="log10")  
cls <- factor(abr1$fact$class)

tmp <- dat.sel(dat, cls, choices=c("1","2"))
x   <- tmp[[1]]$dat
y   <- tmp[[1]]$cls

fs.method <- c("fs.anova","fs.rf","fs.rfe")
fs.pars   <- valipars(sampling="cv",niter=10,nreps=5)
fs <- feat.mfs(x, y, fs.method, fs.pars)   ## with resampling
names(fs)
fs <- fs[1:3]

## save consistency of feature selection
filename  <- "fs.csv"
firstline <- paste('\nResults of feature selection', sep='')

save.tab(fs, filename, firstline)

## End(Not run)
## Not run: 
data(abr1)
dat <- preproc(abr1$pos[,200:400], method="log10")  
cls <- factor(abr1$fact$class)

tmp <- dat.sel(dat, cls, choices=c("1","2"))
x   <- tmp[[1]]$dat
y   <- tmp[[1]]$cls

fs.method <- c("fs.anova","fs.rf","fs.rfe")
fs.pars   <- valipars(sampling="cv",niter=10,nreps=5)
fs <- feat.mfs(x, y, fs.method, fs.pars)   ## with resampling
names(fs)
fs <- fs[1:3]

## save consistency of feature selection
filename  <- "fs.csv"
firstline <- paste('\nResults of feature selection', sep='')

save.tab(fs, filename, firstline)

## End(Not run)

Statistical Summary Utilities for Two-Classes Data

Description

Functions to summarise two-group data.

Usage

stats.vec(x,y, method="mean",test.method = "wilcox.test",fc=TRUE,...)
stats.mat(x,y, method="mean",test.method = "wilcox.test",
          padj.method= "fdr",fc=TRUE,...)
stats.vec(x,y, method="mean",test.method = "wilcox.test",fc=TRUE,...)
stats.mat(x,y, method="mean",test.method = "wilcox.test",
          padj.method= "fdr",fc=TRUE,...)

Arguments

`x`	A data frame or matrix of data set for `stats.mat` or data vector for `stats.vec`.
`y`	A factor or vector of class. Two classes only.
`method`	method for group center such as `mean` or `median`.
`test.method`	method for p-values from parametric test such as `t.test` or non-parametric test such as `wilcox.test`.
`padj.method`	method for p-values correction. Can be one in `p.adjust.methods`: "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr" and "none".
`fc`	a flag for fold-change.
`...`	Additional parameters.

Value

stats.vec returns an vector of center, group center, fold change, log2 fold change, AUC and p-value.

stats.mat, which is an wrapper function of stats.vec, returns an data frame of center, group center, fold change, log2 fold-change, AUC, p-value and adjusted p-values.

Note

If x has negative values, the results of fold-change and log2 fold-change are not reliable.

Author(s)

Wanchang Lin

Examples

data(iris)
sel <- c("setosa", "versicolor")
grp <- iris[,5]
idx <- match(iris[,5],sel,nomatch = 0) > 0

cls <- factor(iris[idx,5])
dat <- iris[idx,1:4]

## stats for the single vector
stats.vec(dat[[1]],cls, method = "median",test.method = "wilcox.test")

## use matrix format
tab <- stats.mat(dat,cls, method = "median",test.method = "wilcox.test",
                 padj.method = "BH")
sapply(tab, class)

data(iris)
sel <- c("setosa", "versicolor")
grp <- iris[,5]
idx <- match(iris[,5],sel,nomatch = 0) > 0

cls <- factor(iris[idx,5])
dat <- iris[idx,1:4]

## stats for the single vector
stats.vec(dat[[1]],cls, method = "median",test.method = "wilcox.test")

## use matrix format
tab <- stats.mat(dat,cls, method = "median",test.method = "wilcox.test",
                 padj.method = "BH")
sapply(tab, class)

Generate Index of Training Samples

Description

Generate index of training samples. The sampling scheme includes leave-one-out cross-validation (loocv), cross-validation (cv), randomised validation (random) and bootstrap (boot).

Usage

trainind(cl, pars = valipars())
trainind(cl, pars = valipars())

Arguments

`cl`	A factor or vector of class.
`pars`	A list of sampling parameters for generating training index. It has the same structure as the output of `valipars`. See `valipars` for details.

Value

Returns a list of training index.

Author(s)

Wanchang Lin

Examples

## A trivia example
x <- as.factor(sample(c("a","b"), 20, replace=TRUE))
table(x)
pars <- valipars(sampling="rand", niter=2, nreps=4, strat=TRUE,div=2/3)
(temp <- trainind(x,pars=pars))
(tmp  <- temp[[1]])
x[tmp[[1]]];table(x[tmp[[1]]])     ## train idx
x[tmp[[2]]];table(x[tmp[[2]]])
x[tmp[[3]]];table(x[tmp[[3]]])
x[tmp[[4]]];table(x[tmp[[4]]])

x[-tmp[[1]]];table(x[-tmp[[1]]])   ## test idx
x[-tmp[[2]]];table(x[-tmp[[2]]])
x[-tmp[[3]]];table(x[-tmp[[3]]])
x[-tmp[[4]]];table(x[-tmp[[4]]])

# iris data set
data(iris)
dat <- subset(iris, select = -Species)
cl  <- iris$Species

## generate 5-fold cross-validation samples
cv.idx <- trainind(cl, pars = valipars(sampling="cv", niter=2, nreps=5))

## generate leave-one-out cross-validation samples
loocv.idx <- trainind(cl, pars = valipars(sampling = "loocv"))

## generate bootstrap samples with 25 replications
boot.idx <- trainind(cl, pars = valipars(sampling = "boot", niter=2,
                                           nreps=25))

## generate randomised samples with 1/4 division and 10 replications. 
rand.idx <- trainind(cl, pars = valipars(sampling = "rand", niter=2, 
                                           nreps=10, div = 1/4))


## A trivia example
x <- as.factor(sample(c("a","b"), 20, replace=TRUE))
table(x)
pars <- valipars(sampling="rand", niter=2, nreps=4, strat=TRUE,div=2/3)
(temp <- trainind(x,pars=pars))
(tmp  <- temp[[1]])
x[tmp[[1]]];table(x[tmp[[1]]])     ## train idx
x[tmp[[2]]];table(x[tmp[[2]]])
x[tmp[[3]]];table(x[tmp[[3]]])
x[tmp[[4]]];table(x[tmp[[4]]])

x[-tmp[[1]]];table(x[-tmp[[1]]])   ## test idx
x[-tmp[[2]]];table(x[-tmp[[2]]])
x[-tmp[[3]]];table(x[-tmp[[3]]])
x[-tmp[[4]]];table(x[-tmp[[4]]])

# iris data set
data(iris)
dat <- subset(iris, select = -Species)
cl  <- iris$Species

## generate 5-fold cross-validation samples
cv.idx <- trainind(cl, pars = valipars(sampling="cv", niter=2, nreps=5))

## generate leave-one-out cross-validation samples
loocv.idx <- trainind(cl, pars = valipars(sampling = "loocv"))

## generate bootstrap samples with 25 replications
boot.idx <- trainind(cl, pars = valipars(sampling = "boot", niter=2,
                                           nreps=25))

## generate randomised samples with 1/4 division and 10 replications. 
rand.idx <- trainind(cl, pars = valipars(sampling = "rand", niter=2, 
                                           nreps=10, div = 1/4))

Functions for Tuning Appropriate Number of Components

Description

Tune appropriate number of components (ncomp) for plsc, plslda or pcalda.

Usage

tune.plsc(x,y, pls="simpls",ncomp=10, tune.pars,...)

tune.plslda(x,y, pls="simpls",ncomp=10, tune.pars,...)

tune.pcalda(x,y, ncomp=NULL, tune.pars,...)
tune.plsc(x,y, pls="simpls",ncomp=10, tune.pars,...)

tune.plslda(x,y, pls="simpls",ncomp=10, tune.pars,...)

tune.pcalda(x,y, ncomp=NULL, tune.pars,...)

Arguments

`x`	A matrix or data frame containing the explanatory variables if no formula is given as the principal argument.
`y`	A factor specifying the class for each observation if no formula principal argument is given.
`pls`	A method for calculating PLS scores and loadings. The following methods are supported: `simpls:` SIMPLS algorithm. `kernelpls:` kernel algorithm. `oscorespls:` orthogonal scores algorithm. For details, see `simpls.fit`, `kernelpls.fit` and `oscorespls.fit` in package pls.
`ncomp`	The number of components to be used in the classification.
`tune.pars`	A list of parameters using by the resampling method. See `valipars` for details.
`...`	Further parameters passed to `tune`.

Value

A list including:

`ncomp`	The best number of components.
`acc.tune`	Accuracy rate of components.

Author(s)

Wanchang Lin

Examples

## Not run: 
data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- preproc(abr1$pos , y=cl, method=c("log10"),add=1)[,110:500]

## divide data as training and test data
idx <- sample(1:nrow(dat), round((2/3)*nrow(dat)), replace=FALSE) 

## construct train and test data 
train.dat  <- dat[idx,]
train.t    <- cl[idx]
test.dat   <- dat[-idx,]        
test.t     <- cl[-idx] 

## tune the best number of components
ncomp.plsc   <- tune.plsc(dat,cl, pls="simpls",ncomp=20)
ncomp.plslda <- tune.plslda(dat,cl, pls="simpls",ncomp=20)
ncomp.pcalda <- tune.pcalda(dat,cl, ncomp=60)

## model fit
(z.plsc   <- plsc(train.dat,train.t, ncomp=ncomp.plsc$ncomp))
(z.plslda <- plslda(train.dat,train.t, ncomp=ncomp.plslda$ncomp))
(z.pcalda <- pcalda(train.dat,train.t, ncomp=ncomp.pcalda$ncomp))

## or indirect use tune function in model fit
z.plsc   <- plsc(train.dat,train.t, ncomp=20, tune=TRUE)
z.plslda <- plslda(train.dat,train.t, ncomp=20, tune=TRUE)
z.pcalda <- pcalda(train.dat,train.t, ncomp=60, tune=TRUE)

## predict test data
pred.plsc   <- predict(z.plsc, test.dat)$class
pred.plslda <- predict(z.plslda, test.dat)$class
pred.pcalda <- predict(z.pcalda, test.dat)$class

## classification rate and confusion matrix
cl.rate(test.t, pred.plsc)
cl.rate(test.t, pred.plslda)
cl.rate(test.t, pred.pcalda)


## End(Not run)
## Not run: 
data(abr1)
cl   <- factor(abr1$fact$class)
dat  <- preproc(abr1$pos , y=cl, method=c("log10"),add=1)[,110:500]

## divide data as training and test data
idx <- sample(1:nrow(dat), round((2/3)*nrow(dat)), replace=FALSE) 

## construct train and test data 
train.dat  <- dat[idx,]
train.t    <- cl[idx]
test.dat   <- dat[-idx,]        
test.t     <- cl[-idx] 

## tune the best number of components
ncomp.plsc   <- tune.plsc(dat,cl, pls="simpls",ncomp=20)
ncomp.plslda <- tune.plslda(dat,cl, pls="simpls",ncomp=20)
ncomp.pcalda <- tune.pcalda(dat,cl, ncomp=60)

## model fit
(z.plsc   <- plsc(train.dat,train.t, ncomp=ncomp.plsc$ncomp))
(z.plslda <- plslda(train.dat,train.t, ncomp=ncomp.plslda$ncomp))
(z.pcalda <- pcalda(train.dat,train.t, ncomp=ncomp.pcalda$ncomp))

## or indirect use tune function in model fit
z.plsc   <- plsc(train.dat,train.t, ncomp=20, tune=TRUE)
z.plslda <- plslda(train.dat,train.t, ncomp=20, tune=TRUE)
z.pcalda <- pcalda(train.dat,train.t, ncomp=60, tune=TRUE)

## predict test data
pred.plsc   <- predict(z.plsc, test.dat)$class
pred.plslda <- predict(z.plslda, test.dat)$class
pred.pcalda <- predict(z.pcalda, test.dat)$class

## classification rate and confusion matrix
cl.rate(test.t, pred.plsc)
cl.rate(test.t, pred.plslda)
cl.rate(test.t, pred.pcalda)


## End(Not run)

Generate Control Parameters for Resampling

Description

Generate the control parameters for resampling process.

Usage

valipars(sampling="cv", niter=10, nreps=10, strat=FALSE,div = 2/3) 
valipars(sampling="cv", niter=10, nreps=10, strat=FALSE,div = 2/3)

Arguments

`sampling`	Sampling scheme. Valid options are: `loocv`. Leave-one-out cross-validation `cv`. Cross-validation (default) `rand`. Randomised validation (holdout) `boot`. Bootstrap
`niter`	Number of iteration or repeat for validation.
`nreps`	Number of replications in each iteration.
`strat`	A logical value indicating whether the stratification should be applied to `cv`, `rand` and `boot`.
`div`	Proportion of data used for training in randomised validation method.

Details

valipars provides a list of control parameters for the resampling or validation in the process of accuracy evaluation or feature selection process.

Value

An object of class valipars containing all the above parameters (either the defaults or the user specified values).

Author(s)

Wanchang Lin

Examples

## generate control parameters for the re-sampling scheme with 5-fold 
## cross-validation and iteration of 10 times
valipars(sampling = "cv", niter = 10, nreps = 5)

## generate control parameters for the re-sampling scheme with 
## 25-replication bootstrap and iteration of 100 times
valipars(sampling = "boot", niter = 100, nreps = 25,strat=TRUE)

## generate control parameters for the re-sampling scheme with 
## leave-one-out cross-validation
valipars(sampling = "loocv")

## generate control parameters for the re-sampling scheme with 5-fold 
## cross-validation and iteration of 10 times
valipars(sampling = "cv", niter = 10, nreps = 5)

## generate control parameters for the re-sampling scheme with 
## 25-replication bootstrap and iteration of 100 times
valipars(sampling = "boot", niter = 100, nreps = 25,strat=TRUE)

## generate control parameters for the re-sampling scheme with 
## leave-one-out cross-validation
valipars(sampling = "loocv")

Package 'mt'

Help Index

abr1 Data

Description

Usage

Details

Value

Examples

Estimate Classification Accuracy By Resampling Method

Description

Usage

Arguments

Details

Value

Note

Author(s)

See Also

Examples

Binary Classification

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Calculate .632 and .632+ Bootstrap Error Rate

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples

Boxplot Method for Class 'frankvali'

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Boxplot Method for Class 'maccest'

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Assess Classification Performances

Description

Usage

Arguments

Details

Value

Note

Author(s)

References

Examples

Wrapper Function for Classifiers

Description

Usage

Arguments

Value

Note

Author(s)

See Also

Examples

Correlation Analysis Utilities

Description

Usage

Arguments

Details

Value

Author(s)

References