[2020-04-26]

We have a total of 8 populations with 4 pairs tested on one of 4 herbicides with different modes of action:

print(data.frame(POPULATION=c("ACC13", "ACC49", "ACC31", "ACC62", "ACC11", "ACC59", "ACC09", "ACC59"), HERBICIDE_TREATMENT=rep(c("Clethodim", "Glyphosate", "Sulfometuron", "Terbuthylazine"), each=2)))

In each population, the individuals were sorted according to increasing quantitative herbicide resistance levels, then divided into 5 pools. Giving us in total, 40 data-points, with 10 data-points per herbicide resistance and 5 data-points per population.

With this serevely magnified n << p problem in our PoolGPAS datasets, can we accurately predict phenotypes?

Using the GWAlpha.jl to build our genomic prediction models. We then ask:

  1. Which genomic prediction model performs best?
  2. How does genomic prediction accuracies vary across herbicide treatments?

Load violinplotter for quick distribution visualization and mean comparisons

library(violinplotter)

Load the concatenated *.rmse csv file

dat = read.csv("GP_PERFORMANCES.csv", header=FALSE)
str(dat)
'data.frame':   296 obs. of  5 variables:
 $ V1: Factor w/ 8 levels "ACC09","ACC11",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ V2: Factor w/ 5 levels "GLMNET_ALPHA0.0",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ V3: Factor w/ 8 levels "ACC09","ACC11",..: 1 2 3 4 5 6 7 8 1 2 ...
 $ V4: num  1 -0.0115 0.6342 0.715 0.4882 ...
 $ V5: num  3.65e-15 1.66e-02 2.72e-01 4.27e-01 9.14e-01 ...

Add herbicide columns

colnames(dat) = c("TRAINING_POP", "TRAINING_MODEL", "VALIDATION_POP", "CORRELATION", "RMSE")
herbi_df_train = data.frame(TRAINING_POP=c("ACC13", "ACC49", "ACC31", "ACC62", "ACC11", "ACC59", "ACC09", "ACC54"), TRAINING_HERBICIDE=rep(c("CLETHODIM", "GLYPHOSATE", "SULFOMETURON", "TERBUTHYLAZINE"), each=2))
dat = merge(dat, herbi_df_train, by="TRAINING_POP")
herbi_df_test = data.frame(VALIDATION_POP=c("ACC13", "ACC49", "ACC31", "ACC62", "ACC11", "ACC59", "ACC09", "ACC54"), VALIDATION_HERBICIDE=rep(c("CLETHODIM", "GLYPHOSATE", "SULFOMETURON", "TERBUTHYLAZINE"), each=2))
dat = merge(dat, herbi_df_test, by="VALIDATION_POP")

Preliminary plots

Plotting the ditributions of the 2 genomic prediction metrics:

par(mfrow=c(1,2))
hist(dat$CORRELATION, main="", xlab="Correlation(Observed, Predicted)", col="#80b1d3", bord="#377eb8")
hist(dat$RMSE, main="", xlab="Root Mean Square Error", col="#fb8072", bord="#e41a1c")

The metrics vary within the bounds we expect for correlation and the root mean square error, between -1 to +1 and between 0 to 1, respectively.

Sanity checking comparing sensical vs non-sensical genomic predictions, i.e. within herbicide treatment and across herbicide treatment predictions:

dat$SENSE = rep("NON-SENSICAL", times=nrow(dat))
dat$SENSE[dat$TRAINING_HERBICIDE==dat$VALIDATION_HERBICIDE] = "SENSICAL"
par(mfrow=c(1,2))
violinplotter(CORRELATION ~ SENSE, data=dat)
======================================================
Violin Plotting: SENSE
======================================================
[[1]]
[[1]]$HSD_out.LEVELS
[1] NON_SENSICAL SENSICAL    
Levels: NON_SENSICAL SENSICAL

[[1]]$HSD_out.GROUPING
[1] b a
Levels: a b

[[1]]$HSD_out.NUMBERS
[1] 1 2

[[1]]$HSD_out.MEANS
[1] 0.01552787 0.43118514
violinplotter(RMSE ~ SENSE, data=dat)
======================================================
Violin Plotting: SENSE
======================================================
[[1]]
[[1]]$HSD_out.LEVELS
[1] NON_SENSICAL SENSICAL    
Levels: NON_SENSICAL SENSICAL

[[1]]$HSD_out.GROUPING
[1] a b
Levels: a b

[[1]]$HSD_out.NUMBERS
[1] 1 2

[[1]]$HSD_out.MEANS
[1] 0.4203112 0.2330011

Genomic prediction within herbicide treatments (“SENSICAL”) performed better than prediction across herbicide treatments (“NON-SENSICAL”), which is what we expected.

We then remove these non-sensical genomic predictions:

dat = droplevels(dat[dat$SENSE=="SENSICAL", ])

Comparing auto-cross-validation (model training and validation on the same population) and the proper validation (validation on another population):

dat$AUTOCROSS = rep("AUTO-CROSS", times=nrow(dat))
dat$AUTOCROSS[as.character(dat$TRAINING_POP) != as.character(dat$VALIDATION_POP)] = "PROPER-CROSS"
par(mfrow=c(1,2))
violinplotter(CORRELATION ~ AUTOCROSS, dat)
======================================================
Violin Plotting: AUTOCROSS
======================================================
[[1]]
[[1]]$HSD_out.LEVELS
[1] AUTO_CROSS   PROPER_CROSS
Levels: AUTO_CROSS PROPER_CROSS

[[1]]$HSD_out.GROUPING
[1] a b
Levels: a b

[[1]]$HSD_out.NUMBERS
[1] 1 2

[[1]]$HSD_out.MEANS
[1]  0.88630107 -0.01054504
violinplotter(RMSE ~ AUTOCROSS, dat)
======================================================
Violin Plotting: AUTOCROSS
======================================================
[[1]]
[[1]]$HSD_out.LEVELS
[1] AUTO_CROSS   PROPER_CROSS
Levels: AUTO_CROSS PROPER_CROSS

[[1]]$HSD_out.GROUPING
[1] b a
Levels: a b

[[1]]$HSD_out.NUMBERS
[1] 1 2

[[1]]$HSD_out.MEANS
[1] 0.06333041 0.40267170

As expected, auto-cross-validation performed better than cross-validation on a different population.

We then remove these auto-cross-validations:

dat = droplevels(dat[dat$AUTOCROSS=="PROPER-CROSS", ])

Now on the the main questions: 1. Which genomic prediction model using Pool-seq data (Pool-GP model) currently implemented in GWAlpha.jl performs best? 2. How does genomic prediction accuracies vary across the herbicide treatments?

aggregate(CORRELATION ~ TRAINING_HERBICIDE, FUN=mean, data=dat)
aggregate(CORRELATION ~ TRAINING_MODEL, FUN=mean, data=dat)
aggregate(RMSE ~ TRAINING_HERBICIDE, FUN=mean, data=dat)
aggregate(RMSE ~ TRAINING_MODEL, FUN=mean, data=dat)

In terms of the correlation between observed and predicted herbicide resistance:

violinplotter(CORRELATION ~ TRAINING_MODEL + TRAINING_HERBICIDE, dat)
======================================================
Violin Plotting: TRAINING_MODEL
======================================================
[[1]]
[[1]]$HSD_out.LEVELS
[1] GLMNET_ALPHA0.0 GLMNET_ALPHA0.5 GLMNET_ALPHA1.0 GWAlpha         MIXEDREML_FST  
Levels: GLMNET_ALPHA0.0 GLMNET_ALPHA0.5 GLMNET_ALPHA1.0 GWAlpha MIXEDREML_FST

[[1]]$HSD_out.GROUPING
[1] a a a a a
Levels: a

[[1]]$HSD_out.NUMBERS
[1] 1 2 3 4 5

[[1]]$HSD_out.MEANS
[1] -0.042642980 -0.131238527  0.101426215  0.037449175 -0.009915455


[[2]]
[[2]]$HSD_out.LEVELS
[1] CLETHODIM      GLYPHOSATE     SULFOMETURON   TERBUTHYLAZINE
Levels: CLETHODIM GLYPHOSATE SULFOMETURON TERBUTHYLAZINE

[[2]]$HSD_out.GROUPING
[1] b a a a
Levels: a b

[[2]]$HSD_out.NUMBERS
[1] 1 2 3 4

[[2]]$HSD_out.MEANS
[1] -0.5798616  0.2018083  0.1711151  0.2055418

No significant difference was observed between models and herbicide resistances.

In terms of the correlation between observed and predicted herbicide resistance:

violinplotter(RMSE ~ TRAINING_MODEL + TRAINING_HERBICIDE, dat)
======================================================
Violin Plotting: TRAINING_MODEL
======================================================
[[1]]
[[1]]$HSD_out.LEVELS
[1] GLMNET_ALPHA0.0 GLMNET_ALPHA0.5 GLMNET_ALPHA1.0 GWAlpha         MIXEDREML_FST  
Levels: GLMNET_ALPHA0.0 GLMNET_ALPHA0.5 GLMNET_ALPHA1.0 GWAlpha MIXEDREML_FST

[[1]]$HSD_out.GROUPING
[1] a a a a a
Levels: a

[[1]]$HSD_out.NUMBERS
[1] 1 2 3 4 5

[[1]]$HSD_out.MEANS
[1] 0.4303946 0.4315579 0.4076578 0.3749691 0.3764784


[[2]]
[[2]]$HSD_out.LEVELS
[1] CLETHODIM      GLYPHOSATE     SULFOMETURON   TERBUTHYLAZINE
Levels: CLETHODIM GLYPHOSATE SULFOMETURON TERBUTHYLAZINE

[[2]]$HSD_out.GROUPING
[1] a c d b
Levels: a b c d

[[2]]$HSD_out.NUMBERS
[1] 1 2 3 4

[[2]]$HSD_out.MEANS
[1] 0.76459161 0.28854151 0.01984712 0.42285919