Compute PGS from GWAS summary statistics using Bayesian sum of single-effect (SuSiE) linear regression using z scores

'rapidopgs_multi computes PGS from a from GWAS summary statistics using Bayesian sum of single-effect (SuSiE) linear regression using z scores

rapidopgs_multi(
  data,
  reference = NULL,
  LDmatrices = NULL,
  N = NULL,
  build = c("hg19", "hg38"),
  trait = c("cc", "quant"),
  ncores = 1,
  alpha.block = 1e-04,
  alpha.snp = 0.01,
  sd.prior = NULL,
  ancestry = "EUR",
  LDblocks = NULL
)

Arguments

data: a data.table containing GWAS summary statistic dataset with all required information.
reference: a string representing the path to the directory containing the reference panel (eg. "../ref-data").
LDmatrices: a string representing the path to the directory containing the pre-computed LD matrices.
N: a numeric indicating the number of individuals used to generate input GWAS dataset, or a string indicating the column name containing per-SNP sample size.
build: a string indicating the genome build. 'hg19' and 'hg38' are supported. Note that your LD matrices or reference panel should match the build.
trait: a string indicating if trait is a case-control ("cc") or quantitative ("quant").
ncores: a numeric specifying the number of cores (CPUs) to be used. If using pre-computed LD matrices, one core is enough for best performance.
alpha.block: a numeric threshold for minimum P-value in LD blocks. Blocks with minimum P above alpha.block will be skipped. Default: 1e-4.
alpha.snp: a numeric threshold for P-value pruning within LD block. SNPs with P above alpha.snp will be removed. Default: 0.01.
sd.prior: the prior specifies that BETA at causal SNPs follows a centred normal distribution with standard deviation sd.prior. If NULL (default) it will be automatically estimated (recommended).
ancestry: a string indicating the ancestral population (DEFAULT: "EUR", European). If using an alternative population, bear in mind that your LD matrices or reference must be from the same population. You'll also need to provide matching LD.blocks via the LDblocks argument.
LDblocks: a string indicating the path to an alternative LD block file in .RData format. Only required for non-European PGS.

Value

a data.table containing the sumstats dataset with computed PGS weights.

Details

This function will take a GWAS summary statistic dataset as an input, will assign LD blocks to it, then use user-provided LD matrices or a preset reference panel in Plink format to compute LD matrices for each block. Then SuSiE method will be used to compute posterior probabilities of variants to be causal and generate PGS weights by multiplying those posteriors by effect sizes (\(\beta\)). Unlike rapidopgs_single, this approach will assume one or more causal variants.

The GWAS summary statistics file to compute PGS using our method must contain the following minimum columns, with these exact column names:

CHR: Chromosome
BP: Base position (in GRCh37/hg19).
REF: Reference, or non-effect allele
ALT: Alternative, or effect allele, the one \(\beta\) refers to
BETA: \(\beta\) (or log(OR)), or effect sizes
SE: standard error of \(\beta\)
P: P-value for the association test

In addition, quantitative traits must have the following extra column:

ALT_FREQ: Minor allele frequency.

Also, for quantitative traits, sample size must be supplied, either as a number, or indicating the column name, for per-SNP sample size datasets (see below). Other columns are allowed, and will be ignored.

Reference panel should be divided by chromosome, in Plink format. Both reference panel and summary statistic dataset should be in GRCh37/hg19. For 1000 Genomes panel, you can use create_1000G function to set it up automatically.

If prefer to use LD matrices, you must indicate the path to the directory where they are stored. They must be in RDS format, named LD_chrZ.rds (where Z is the 1-22 chromosome number). If you don't have LD matrices already, we recommend downloading those gently provided by Prive et al., at https://figshare.com/articles/dataset/European_LD_reference/13034123. These matrices were computed using for 1,054,330 HapMap3 variants based on 362,320 European individuals of the UK biobank.

Author

Guillermo Reales, Chris Wallace

Examples

if (FALSE) { # \dontrun{
ss <- data.table(
    CHR=c(4,20,14,2,4,6,6,21,13), 
    BP=c(1479959, 13000913, 29107209, 203573414, 57331393, 11003529, 149256398, 
        25630085, 79166661), 
    REF=c("C","C","C","T","G","C","C","G","T"), 
    ALT=c("A","T","T","A","A","A","T","A","C"), 
    BETA=c(0.012,0.0079,0.0224,0.0033,0.0153,0.058,0.0742,0.001,-0.0131),
    SE=c(0.0099,0.0066,0.0203,0.0171,0.0063,0.0255,0.043,0.0188,0.0074),
    P=c(0.2237,0.2316,0.2682,0.8477,0.01473,0.02298,0.08472,0.9573,0.07535))
PGS <- rapidopgs_multi(ss, reference = "ref-data/", N = 20000, build = "hg19", trait="cc", ncores=5)
} # }