Guide to calculating spatial genes
Giotto supports multiple ways for searching for spatially variable genes. Currently we have incorporated SpatialDE, trendceek, Spark, as well as two methods that we have developed binSpect and silhouetteRank. The common goal is to score genes in the spatial transcriptomic dataset based on the extent to which individual genes' expression values form a spatially coherent pattern (or whether there is a dependence of expression on spatial locations). The methods achieve this goal through various algorithms and statistical tests.
Methods
SpatialDE
The method uses Gaussian process regression to decompose expression variability into a spatial covariance term and nonspatial variance term. The spatial covariance term assumes a linear trend and periodic pattern of gene expression variation. Multiple different spatial covariance functions are tested including: (1) null model, (2) general Gaussian covariance (squared exponential), (3) linear covariance, and (4) periodic covariance functions. A suitable model is selected using Bayes information criterion.
Trendsceek
The method employs previous functions within R called spatstat. In basic, it computes four statistics mark-seggregation summary statistics, including the Stoyan's mark-correlation function, mean-mark function, variance-mark function and mark-variogram of marked point process. Mark seggregation computes the probability of finding two marks given the separation of two points. Considering the distribution of all pairs at a particular radius, a mark segregation is said to be present if this distribution is dependent on r, such that it deviates from what would be expected if the marks would be randomly distributed over the spatial locations of the points. For statistical testing, a null distribution of summary statistics is computed for every radius r, after permuting expression labels. The maximum deviation (from expected value) observed among n randomly distributed patterns is compared with the observed deviation (from expected value). A p-value is calculated via the rank.
SPARK
SPARK is a method perceived as an extension of SpatialDE. SPARK models count data directly and employs properly calibrated p-values. Well-calibrated p-values allows this method to find more spatially variable genes at a given FDR cut-off. This is sometimes an improvement over SpatialDE which may produce overly conservative p-values. SPARK models expression levels across spatial locations using generalized linear spatial model (GLSM). It allows modeling the distribution of expression values through an overdispersed Poisson distribution (for count data) or Gaussian distribution (for normalized data). Notably, to make sure that the algorithm can discover various spatial patterns, SPARK employs ten different spatial kernels, including five periodic kernels with different periodicity parameters and five Gaussian kernels with different smoothness parameters.
BinSpect
This method uses a graph based approach to compute the probability of encountering two physically neighboring cells being both expressed. It first computes a KNN neighborhood graph for a given k (defined by user) based on cells' physical locations. For each gene, it binarizes gene expression value across all the cells (1 for expressed, 0 for nonexpressed). An edge in the KNN graph connecting two cells is classified as 1-0, 1-1, 0-1, or 0-0 based on gene expression label. A contigency table tallying all the 1-1's, 1-0's, 0-1's, and 0-0's edges for the entire KNN network is created. A P-value is reported based on the hypergeometric distribution test.
SilhouetteRank
This method stands for silhouette coefficient on binarized spatial gene expression profiles. As the name suggests, gene expression is first binarized to 1's (expressed) and 1's (nonexpressed). It next calculates silhouette coefficient of the 1-marked cells on a locally weighted spatial distance matrix. The distance function is 1 -similarity, where the similarity function is a rank-based, exponentially-transformed score emphasizing more the closely located cells and penalizes far away cells' distance. For statistical testing, N randomly permuted patterns are generated by shuffling the cell locations while keeping the distribution of 1's and 0's the same as real data. The extreme right-tail of distribution of random patterns' silhouette scores is modeled using the GPD distribution.
Usage
SpatialDE
spatialDE function
spatialDE <- function(gobject = NULL, expression_values = c('raw', 'normalized', 'scaled', 'custom'), size = c(4,2,1), color = c("blue", "green", "red"), sig_alpha = 0.5, unsig_alpha = 0.5, python_path = NULL, show_plot = NA, return_plot = NA, save_plot = NA, save_param = list(), default_save_name = 'SpatialDE')
The input is a gene expression matrix. There are 4 version of expression matrix (indicated by expression_values
). Raw version (in counts) is recommended. SpatialDE performs library size normalization (by default) if raw expression is used. Otherwise, one can also use “normalized” and skip SpatialDE normalization step.
There are no other parameters required. The parameters color
, sig_alpha
, unsig_alpha
are used for plotting the Fraction spatial variance vs Adj. P-value https://github.com/Teichlab/SpatialDE, and is optional. To disable this FSV vs. Adj P-value plot, show_plot
is set to NA (default). The parameters return_plot
, save_plot
, save_param
are for saving the results automatically to disk (default values are NA). They are attached to every function (see CreateGiottoInstructions()
).
Outputs:
A data frame with the results. There are 3 fields reported per gene: LLR, pval, qval. LLR is log-likelihood of model, useful for creating a whole ranking of genes unambiguously. P-val, Q-val are useful for cut-off based approach to filtering the spatial genes.
Trendsceek
trendSceek function
trendSceek <- function(gobject, expression_values = c("normalized", "raw"), subset_genes = NULL, nrand = 100, ncores = 8, ...)
The input is a gene expression matrix. The “normalized” version of the matrix is recommended - this is library size normalized in normalized counts. Trendceek requires performing library size normalization externally, hence “normalized” version is required. The subset_genes
is used for scoring only a set of genes in the expression matrix.
The nrand
is the number of random resamplings of the mark distribution as to create the null distribution. The ncores
is the number of cores.
Outputs:
A list containing trendceek-statistics for every gene.
References:
SPARK
spark function
spark = function(gobject, percentage = 0.1, min_count = 10, expression_values = 'raw', num_core = 5, covariates = NULL, return_object = 'data.table', fit_model=c("poisson", "gaussian"), ...)
The input is a gene expression matrix. The SPARK algorithm requires as input a raw unfiltered expression matrix (in counts). The matrix should be unfiltered as SPARK performs its own gene selection, cell selection, and minimal quality controls to remove unwanted genes and cells. The percentage
and min_count
are SPARK's gene and cell filtering options. The covariates
specifies any additional covariate to account for prior to the algorithm. By default, SPARK already considers the total count per cell as covariate.
min_counts
: The minimum counts for each cell for filtering
percentage
: The percentage of cells that are expressed for each gene for filtering (range from 0 to 1 for 0% to 100%)
fit_model
: Whether to use poisson or gaussian model. Poisson is recommended for count data. Gaussian is recommended for normalized data.
Outputs
A data table containing gene name, combined P-value, and adjusted P-value. Combined P-value contains the P-value after applying Cauchy combination rules to combine the results of 10 spatial kernels. Adj P-value contains the adjusted p-value for multiple hypothesis testing (coming from all the genes).
References
SilhouetteRank
silhouetteRank function
silhouetteRank = function(gobject, expression_values = c('normalized', 'scaled', 'custom'), subset_genes = NULL, metric='euclidean', rbp_p = 0.95, examine_top = 0.300)
The input is a gene expression matrix that is normalized by library size (specified by expression_values="normalized"
).
SilhouetteRank uses two key settings: rbp_p
and examine_top
. The rbp_p
is the local neighborhood distance weighting constant (0.95 - 0.995). The smaller the rbp_p
, the more focus to put on local distances. The distances are exponentially decayed according to rbp_p
. The option examine_top
is the proportion of cells that participate in the spatial pattern (0 - 1.0). Small examine_top
(0.01) tests for streak and very focal pattern. Large examine_top
(0.3) finds large spatial pattern.
If example_top
is set to a high proportion such that (example_top * num_cells) > (num_expressed_cells)
of a given gene, then a number of cells equal to the difference is randomly selected among the non-expressed groups to become the expressed population. Then recalculate the silhouette score. This is done so as to make all genes comparable regardless of any difference in terms of the number of detected cells.
Note: the function can be repeated on a spatially permuted expression matrix to generate an empirical P-value for each silhouette score.
Outputs
A data frame containing gene name, the silhouette score.
References
BinSpect
binSpect function
binSpect = function(gobject, bin_method = c('kmeans', 'rank'), expression_values = c('normalized', 'scaled', 'custom'), subset_genes = NULL, reduce_network = FALSE, kmeans_algo = c('kmeans', 'kmeans_arma', 'kmeans_arma_subset'), nstart = 3, iter_max = 10, extreme_nr = 50, spatial_network_name='Delaunay_network', sample_nr = 50, percentage_rank = 30, do_fisher_test = TRUE, adjust_method = 'fdr', calc_hub = FALSE, hub_min_int = 3, get_av_expr = TRUE, get_high_expr = TRUE, implementation = c('data.table', 'simple', 'matrix'), group_size = 'automatic', do_parallel = TRUE, cores = NA, verbose = T, knn_params = NULL, set.seed = NULL, summarize = c('adj.p.value', 'p.value'))
The input is a normalized gene expression matrix (expression_values="normalized"
).
Prior to binSpect
, the spatial network (for example KNN network) based on spatial locations should be created (see function createSpatialNetwork
for details). Afterward, the name of the spatial network should be given to spatial_network_name
in binSpect
. For example: createSpatialNetwork(g_object, method='kNN', name='knn_network', k=20)
, then binSpect(gobject, spatial_network_name='knn_network', ...)
.
The key parameter is the percentage_rank
, which is the percentage of top expressed cells to be binarized to 1's.
The bin_method
is binarization based on K-means or rank. For k-means, a 2-class K-means is performed and the higher class is assigned 1's (nstart
, iter_max
are pertinent parameters of K-means). For rank, the top ranked percentage of cells are assigned 1's (percentage_rank
is the pertinent parameter).
Other optional settings calc_hub
, hub_min_int
, get_av_expr
, and get_high_expr
will also report statistics about individual genes:
calc_hub
: calculate the number of hub cells
hub_min_int
: minimum number of cell-cell interactions for a hub cell
get_av_expr
: calculate the average expression per gene of the high expressing cells
get_high_expr
: calculate the number of high expressing cells per gene
BinSpect runs in parallel mode by default (do_parallel
) and will utilize all cores. Number of cores can be set by cores
.
Outputs
A data table with results.
References