[2020-04-26]
We have a total of 8 populations with 4 pairs tested on one of 4 herbicides with different modes of action:
print(data.frame(POPULATION=c("ACC13", "ACC49", "ACC31", "ACC62", "ACC11", "ACC59", "ACC09", "ACC59"), HERBICIDE_TREATMENT=rep(c("Clethodim", "Glyphosate", "Sulfometuron", "Terbuthylazine"), each=2)))
In each population, the individuals were sorted according to increasing quantitative herbicide resistance levels, then divided into 5 pools. Giving us in total, 40 data-points, with 10 data-points per herbicide resistance and 5 data-points per population.
With this serevely magnified n << p problem in our PoolGPAS datasets, can we accurately predict phenotypes?
Using the GWAlpha.jl to build our genomic prediction models. We then ask:
Load violinplotter for quick distribution visualization and mean comparisons
library(violinplotter)
Load the concatenated *.rmse csv file
dat = read.csv("GP_PERFORMANCES.csv", header=FALSE)
str(dat)
'data.frame': 296 obs. of 5 variables:
$ V1: Factor w/ 8 levels "ACC09","ACC11",..: 1 1 1 1 1 1 1 1 1 1 ...
$ V2: Factor w/ 5 levels "GLMNET_ALPHA0.0",..: 1 1 1 1 1 1 1 1 2 2 ...
$ V3: Factor w/ 8 levels "ACC09","ACC11",..: 1 2 3 4 5 6 7 8 1 2 ...
$ V4: num 1 -0.0115 0.6342 0.715 0.4882 ...
$ V5: num 3.65e-15 1.66e-02 2.72e-01 4.27e-01 9.14e-01 ...
Add herbicide columns
colnames(dat) = c("TRAINING_POP", "TRAINING_MODEL", "VALIDATION_POP", "CORRELATION", "RMSE")
herbi_df_train = data.frame(TRAINING_POP=c("ACC13", "ACC49", "ACC31", "ACC62", "ACC11", "ACC59", "ACC09", "ACC54"), TRAINING_HERBICIDE=rep(c("CLETHODIM", "GLYPHOSATE", "SULFOMETURON", "TERBUTHYLAZINE"), each=2))
dat = merge(dat, herbi_df_train, by="TRAINING_POP")
herbi_df_test = data.frame(VALIDATION_POP=c("ACC13", "ACC49", "ACC31", "ACC62", "ACC11", "ACC59", "ACC09", "ACC54"), VALIDATION_HERBICIDE=rep(c("CLETHODIM", "GLYPHOSATE", "SULFOMETURON", "TERBUTHYLAZINE"), each=2))
dat = merge(dat, herbi_df_test, by="VALIDATION_POP")
Preliminary plots
Plotting the ditributions of the 2 genomic prediction metrics:
par(mfrow=c(1,2))
hist(dat$CORRELATION, main="", xlab="Correlation(Observed, Predicted)", col="#80b1d3", bord="#377eb8")
hist(dat$RMSE, main="", xlab="Root Mean Square Error", col="#fb8072", bord="#e41a1c")
The metrics vary within the bounds we expect for correlation and the root mean square error, between -1 to +1 and between 0 to 1, respectively.
Sanity checking comparing sensical vs non-sensical genomic predictions, i.e. within herbicide treatment and across herbicide treatment predictions:
dat$SENSE = rep("NON-SENSICAL", times=nrow(dat))
dat$SENSE[dat$TRAINING_HERBICIDE==dat$VALIDATION_HERBICIDE] = "SENSICAL"
par(mfrow=c(1,2))
violinplotter(CORRELATION ~ SENSE, data=dat)
======================================================
Violin Plotting: SENSE
======================================================
[[1]]
[[1]]$HSD_out.LEVELS
[1] NON_SENSICAL SENSICAL
Levels: NON_SENSICAL SENSICAL
[[1]]$HSD_out.GROUPING
[1] b a
Levels: a b
[[1]]$HSD_out.NUMBERS
[1] 1 2
[[1]]$HSD_out.MEANS
[1] 0.01552787 0.43118514
violinplotter(RMSE ~ SENSE, data=dat)
======================================================
Violin Plotting: SENSE
======================================================
[[1]]
[[1]]$HSD_out.LEVELS
[1] NON_SENSICAL SENSICAL
Levels: NON_SENSICAL SENSICAL
[[1]]$HSD_out.GROUPING
[1] a b
Levels: a b
[[1]]$HSD_out.NUMBERS
[1] 1 2
[[1]]$HSD_out.MEANS
[1] 0.4203112 0.2330011
Genomic prediction within herbicide treatments (“SENSICAL”) performed better than prediction across herbicide treatments (“NON-SENSICAL”), which is what we expected.
We then remove these non-sensical genomic predictions:
dat = droplevels(dat[dat$SENSE=="SENSICAL", ])
Comparing auto-cross-validation (model training and validation on the same population) and the proper validation (validation on another population):
dat$AUTOCROSS = rep("AUTO-CROSS", times=nrow(dat))
dat$AUTOCROSS[as.character(dat$TRAINING_POP) != as.character(dat$VALIDATION_POP)] = "PROPER-CROSS"
par(mfrow=c(1,2))
violinplotter(CORRELATION ~ AUTOCROSS, dat)
======================================================
Violin Plotting: AUTOCROSS
======================================================
[[1]]
[[1]]$HSD_out.LEVELS
[1] AUTO_CROSS PROPER_CROSS
Levels: AUTO_CROSS PROPER_CROSS
[[1]]$HSD_out.GROUPING
[1] a b
Levels: a b
[[1]]$HSD_out.NUMBERS
[1] 1 2
[[1]]$HSD_out.MEANS
[1] 0.88630107 -0.01054504
violinplotter(RMSE ~ AUTOCROSS, dat)
======================================================
Violin Plotting: AUTOCROSS
======================================================
[[1]]
[[1]]$HSD_out.LEVELS
[1] AUTO_CROSS PROPER_CROSS
Levels: AUTO_CROSS PROPER_CROSS
[[1]]$HSD_out.GROUPING
[1] b a
Levels: a b
[[1]]$HSD_out.NUMBERS
[1] 1 2
[[1]]$HSD_out.MEANS
[1] 0.06333041 0.40267170
As expected, auto-cross-validation performed better than cross-validation on a different population.
We then remove these auto-cross-validations:
dat = droplevels(dat[dat$AUTOCROSS=="PROPER-CROSS", ])
Now on the the main questions: 1. Which genomic prediction model using Pool-seq data (Pool-GP model) currently implemented in GWAlpha.jl performs best? 2. How does genomic prediction accuracies vary across the herbicide treatments?
aggregate(CORRELATION ~ TRAINING_HERBICIDE, FUN=mean, data=dat)
aggregate(CORRELATION ~ TRAINING_MODEL, FUN=mean, data=dat)
aggregate(RMSE ~ TRAINING_HERBICIDE, FUN=mean, data=dat)
aggregate(RMSE ~ TRAINING_MODEL, FUN=mean, data=dat)
In terms of the correlation between observed and predicted herbicide resistance:
violinplotter(CORRELATION ~ TRAINING_MODEL + TRAINING_HERBICIDE, dat)
======================================================
Violin Plotting: TRAINING_MODEL
======================================================
[[1]]
[[1]]$HSD_out.LEVELS
[1] GLMNET_ALPHA0.0 GLMNET_ALPHA0.5 GLMNET_ALPHA1.0 GWAlpha MIXEDREML_FST
Levels: GLMNET_ALPHA0.0 GLMNET_ALPHA0.5 GLMNET_ALPHA1.0 GWAlpha MIXEDREML_FST
[[1]]$HSD_out.GROUPING
[1] a a a a a
Levels: a
[[1]]$HSD_out.NUMBERS
[1] 1 2 3 4 5
[[1]]$HSD_out.MEANS
[1] -0.042642980 -0.131238527 0.101426215 0.037449175 -0.009915455
[[2]]
[[2]]$HSD_out.LEVELS
[1] CLETHODIM GLYPHOSATE SULFOMETURON TERBUTHYLAZINE
Levels: CLETHODIM GLYPHOSATE SULFOMETURON TERBUTHYLAZINE
[[2]]$HSD_out.GROUPING
[1] b a a a
Levels: a b
[[2]]$HSD_out.NUMBERS
[1] 1 2 3 4
[[2]]$HSD_out.MEANS
[1] -0.5798616 0.2018083 0.1711151 0.2055418
No significant difference was observed between models and herbicide resistances.
In terms of the correlation between observed and predicted herbicide resistance:
violinplotter(RMSE ~ TRAINING_MODEL + TRAINING_HERBICIDE, dat)
======================================================
Violin Plotting: TRAINING_MODEL
======================================================
[[1]]
[[1]]$HSD_out.LEVELS
[1] GLMNET_ALPHA0.0 GLMNET_ALPHA0.5 GLMNET_ALPHA1.0 GWAlpha MIXEDREML_FST
Levels: GLMNET_ALPHA0.0 GLMNET_ALPHA0.5 GLMNET_ALPHA1.0 GWAlpha MIXEDREML_FST
[[1]]$HSD_out.GROUPING
[1] a a a a a
Levels: a
[[1]]$HSD_out.NUMBERS
[1] 1 2 3 4 5
[[1]]$HSD_out.MEANS
[1] 0.4303946 0.4315579 0.4076578 0.3749691 0.3764784
[[2]]
[[2]]$HSD_out.LEVELS
[1] CLETHODIM GLYPHOSATE SULFOMETURON TERBUTHYLAZINE
Levels: CLETHODIM GLYPHOSATE SULFOMETURON TERBUTHYLAZINE
[[2]]$HSD_out.GROUPING
[1] a c d b
Levels: a b c d
[[2]]$HSD_out.NUMBERS
[1] 1 2 3 4
[[2]]$HSD_out.MEANS
[1] 0.76459161 0.28854151 0.01984712 0.42285919
Similarly, no significant difference was observed between models; however, differences in RMSE were observed among the herbicide treatments, with Sulfometuron resistance showing the lowest RMSE. This may be biased due to the low phenotypic variation observed for the sulfometuron resistance:
pheno = read.csv("pheno.csv")
pheno$HERBI_POP = paste0(pheno$HERBI, "-", pheno$POP)
str(pheno)
'data.frame': 3519 obs. of 8 variables:
$ LIBRARY : Factor w/ 8 levels "ACC09","ACC11",..: 3 3 3 3 3 3 3 3 3 3 ...
$ HERBI : Factor w/ 4 levels "CLETH","GLYPH",..: 1 1 1 1 1 1 1 1 1 1 ...
$ POP : Factor w/ 8 levels "ACC09","ACC11",..: 3 3 3 3 3 3 3 3 3 3 ...
$ PLANT : Factor w/ 3519 levels "ACC09_001","ACC09_002",..: 1428 1258 1006 1435 1190 1407 1429 1228 1423 1180 ...
$ RESISTANCE: num 0.245 0.277 0.298 0.3 0.306 ...
$ SURVI : int 0 0 0 0 0 0 0 0 1 0 ...
$ POOL : int 1 1 1 1 1 1 1 1 1 1 ...
$ HERBI_POP : chr "CLETH-ACC13" "CLETH-ACC13" "CLETH-ACC13" "CLETH-ACC13" ...
par(mfrow=c(2,4))
colors = rep(c("#b3de69", "#fdb462", "#80b1d3", "#fb8072"), each=2)
for (i in 1:length(unique(pheno$HERBI_POP))){
herbi = unique(pheno$HERBI_POP)[i]
sub = pheno[pheno$HERBI_POP==herbi, ]
perc_survi = aggregate(SURVI*100 ~ POOL, data=sub, FUN=mean)
hist(sub$RESISTANCE, xlab="", main=herbi, col=colors[i], bord=NA)
legend("topright", legend=c("Resistances:", paste0("Pool", perc_survi[,1], " = ", round(perc_survi[,2]), "%")), bty="n")
}
Consistent with other genomic prediction (genomic selection studies), training and validation populations with similar phenotypic range results in higher prediction accuracies (in our case Sulfometuron resistance: ACC11 and ACC59, this and the narrow range of phenotype variation observed), compared with populations with diverging phenotypic range (in our case Clethodim resistance: ACC13 and ACC49). This dataset (PoolGPAS_08SEAu) together with the Inverleigh-Urana (PoolGPAS_InvUra) and the 60 SE Autralian populations (PoolGPAS_60SEAu) datasets will be merged for a 10-fold cross-validation analyses.