starfysh package

Submodules

starfysh.AA module

class starfysh.AA.ArchetypalAnalysis(adata_orig, u=None, u_3d=None, verbose=True, outdir=None, filename=None, savefig=False)[source]

Bases: object

assign_archetypes(anchor_df, threshold=0.2)[source]

Assign best 1-1 mapping of archetype community to its closest anchor community (cell-type specific anchor spots) With spot overlapping ratio >= threshold

Parameters:

anchor_df (pd.DataFrame) – Dataframe of anchor spot indices
threshold (float) – Threshold to determine anchor-archetype mapping

Returns:

map_df (pd.DataFrame) – DataFrame of overlapping spot ratio of each anchor i to archetype j
map_dict (dict) – Dictionary of cell type -> mapped archetype

compute_archetypes(cn=30, n_iters=20, converge=0.001, r=20, display=False)[source]

Estimate the upper bound of archetype count (k) by calculating intrinsic dimension Compute hierarchical archetypes (major + raw) with given granularity

cnint
Conditional Number to choose PCs for intrinsic estimator as lower bound # archetype estimation. Please refer to: https://scikit-dimension.readthedocs.io/en/latest/skdim.id.FisherS.html#skdim.id.FisherS

n_itersint
Max. # iterations of AA to find the best k estimation

convergeint
Convergence criteria for AA iteration with diff(explained variance)

` rint

Resolution parameter to control granularity of major archetypes If two archetypes reside within r nearest neighbors, the latter one will be merged.

displaybool: Whether to display Intrinsic Dimension (ID) estimation plots

archetypenp.ndarray (dim=[K, G]): Raw archetypes as linear combination of subset of spot counts
arche_dictdict: Hierarchical structure of major_archetype -> its fine-grained neighbor archetypes
major_idxint: Index of major archetypes among k raw candidates after merging
evslist: Explained variance with different Ks

find_archetypal_spots(n_neighbors=20, major=True)[source]

Assign N-nearest-neighbor spots to each archetype as archetypal spots (archetype community)

Parameters:

n_neighbors (int (default=40)) – N nearest neighbors of each archetype for archetypal spots
major (bool) – Whether to find NNs for only major archetypes

Returns:

arche_df – Dataframe of archetypal spots

Return type:

pd.DataFrame

find_distant_archetypes(anchor_df, map_dict=None, n=3)[source]

Sort and return top n archetypes that are unmapped and farthest from anchor spots of know cell types They are more likely to represent novel cell types / states

Parameters:

anchor_df (pd.DataFrame) – Dataframe of anchor spot indices
map_dict (dict) – Dictionary of cell type -> mapped archetype
n (int) – Number of distant archetypes to return

Returns:

distant_archetypes – List of archetype labels (farthest –> closest to anchors)

Return type:

list

find_markers(n_markers=30, display=False)[source]

Find marker genes for each archetype community via Wilcoxon rank sum test (in-group vs. out-of-group)

Parameters:: n_markers (int) – Number of top marker genes to find for each archetype community
Returns:: marker_df – Dataframe of marker genes for each archetype community
Return type:: pd.DataFrame

plot_anchor_archetype_clusters(anchor_df, cell_types=None, arche_lbls=None, lgd_ncol=2, do_3d=False)[source]: Joint display subset of anchor spots & archetypal spots (to visualize overlapping degree)

plot_archetypes(major=True, do_3d=False, lgd_ncol=1, figsize=(6, 4), disp_cluster=True, disp_arche=True)[source]: Display archetype & archetypal spot communities

plot_mapping(map_df, figsize=(6, 5))[source]: Display anchor - archetype mapping (overlapping # spot ratio)

starfysh.dataloader module

class starfysh.dataloader.VisiumDataset(adata, args)[source]

Bases: Dataset

Loading preprocessed Visium AnnData, gene signature & Anchor spots for Starfysh training

class starfysh.dataloader.VisiumPoEDataSet(adata, args)[source]

Bases: VisiumDataset

return the data stack with expression and image

starfysh.plot_utils module

starfysh.plot_utils.pl_spatial_inf_feature(adata, feature, factor=None, vmin=0, vmax=None, spot_size=100, alpha=0, cmap='Spectral_r')[source]: Spatial visualization of Starfysh inference features

starfysh.plot_utils.pl_spatial_inf_gene(adata, factor, feature, vmin=0, vmax=None, spot_size=100, alpha=0, cmap='Spectral_r')[source]

starfysh.plot_utils.pl_umap_feature(qz_u, qc, cmap, title, spot_size=3, vmin=0, vmax=None)[source]: Single Z-UMAP visualization of Starfysh deconvolutions

starfysh.plot_utils.plot_anchor_spots(umap_plot, pure_spots, sig_mean, bbox_x=2)[source]

starfysh.plot_utils.plot_evs(evs, kmin)[source]

starfysh.plot_utils.plot_spatial_feature(adata_sample, map_info, variable, label)[source]

starfysh.plot_utils.plot_spatial_gene(adata_sample, map_info, gene_name)[source]

starfysh.post_analysis module

starfysh.post_analysis.create_corr_network_5(G, node_size_list, corr_direction, min_correlation)[source]

starfysh.post_analysis.display_reconst(df_true, df_pred, density=False, marker_genes=None, sample_rate=0.1, size=(3, 3), spot_size=1, title=None, x_label='', y_label='', x_min=0, x_max=10, y_min=0, y_max=10)[source]: Scatter plot - raw gexp vs. reconstructed gexp

starfysh.post_analysis.gene_mean_vs_inferred_prop(inference_outputs, visium_args, idx)[source]

starfysh.post_analysis.get_LISA(W, X)[source]

starfysh.post_analysis.get_Moran(W, X)[source]

starfysh.post_analysis.get_SCI(W, X, Y)[source]

starfysh.post_analysis.get_adata(sample_ids, data_folder)[source]

starfysh.post_analysis.get_cormtx(sample_id, hub_num)[source]

starfysh.post_analysis.get_corr_map(inference_outputs, proportions)[source]

starfysh.post_analysis.get_factor_dist(sample_ids, file_path)[source]

starfysh.post_analysis.get_hub_cormtx(sample_ids, hub_num)[source]

starfysh.post_analysis.get_z_umap(qz_m)[source]

starfysh.post_analysis.plot_density(results, category_names)[source]

Parameters:

results (dict) – A mapping from question labels to a list of answers per category. It is assumed all lists contain the same number of entries and that it matches the length of category_names.
category_names (list of str) – The category labels.

starfysh.post_analysis.plot_stacked_prop(results, category_names)[source]

Parameters:

results (dict) – A mapping from question labels to a list of answers per category. It is assumed all lists contain the same number of entries and that it matches the length of category_names.
category_names (list of str) – The category labels.

starfysh.post_analysis.plot_type_all(inference_outputs, u, proportions)[source]

starfysh.starfysh module

class starfysh.starfysh.AVAE(adata, gene_sig, win_loglib)[source]

Bases: Module

Model design: p(x|z)=f(z) p(z|x)~N(0,1) q(z|x)~g(x)

generative(inference_outputs, xs_k)[source]

get_loss(generative_outputs, inference_outputs, x, x_peri, library, device)[source]

inference(x)[source]

reparameterize(mu, log_var)[source]

Parameters:

mu – mean from the encoder’s latent space
log_var – log variance from the encoder’s latent space

training: bool

class starfysh.starfysh.AVAE_PoE(adata, gene_sig, patch_r, win_loglib)[source]

Bases: Module

Model design:: p(x|z)=f(z) p(z|x)~N(0,1) q(z|x)~g(x)

generative(inference_outputs, xs_k, img_path_outputs)[source]

xs_ktorch.Tensor: Z-normed avg. gene exprs

get_loss(generative_outputs, inference_outputs, img_path_outputs, poe_path_outputs, x, x_peri, library, adata_img, device)[source]

inference(x)[source]

predict_imgVAE(x)[source]

predictor_POE(inference_outputs, exp_path_outputs, img_path_outputs)[source]

reparameterize(mu, log_var)[source]

Parameters:

mu – mean from the encoder’s latent space
log_var – log variance from the encoder’s latent space

training: bool

class starfysh.starfysh.NegBinom(mu, theta, eps=1e-10)[source]

Bases: Distribution

Gamma-Poisson mixture approximation of Negative Binomial(mean, dispersion)

lambda ~ Gamma(mu, theta) x ~ Poisson(lambda)

arg_constraints = {'mu': GreaterThanEq(lower_bound=0), 'theta': GreaterThanEq(lower_bound=0)}

log_prob(x)[source]: log-likelihood

sample()[source]: Generates a sample_shape shaped sample or sample_shape shaped batch of samples if the distribution parameters are batched.

support = IntegerGreaterThan(lower_bound=0)

starfysh.starfysh.model_ct_exp(model, adata, visium_args, poe=False, device=device(type='cpu'))[source]: Obtain predicted cell-type specific expression in each spot

starfysh.starfysh.model_eval(model, adata, visium_args, poe=False, device=device(type='cpu'))[source]

starfysh.starfysh.train(model, dataloader, device, optimizer)[source]

starfysh.starfysh.train_poe(model, dataloader, device, optimizer)[source]

starfysh.starfysh.valid_model(model)[source]

starfysh.utils module

class starfysh.utils.VisiumArguments(adata, adata_norm, gene_sig, img_metadata, **kwargs)[source]

Bases: object

Loading Visium AnnData, perform preprocessing, library-size smoothing & Anchor spot detection

Parameters:

adata (AnnData) – annotated visium count matrix
adata_norm (AnnData) – annotated visium count matrix after normalization & log-transform
gene_sig (pd.DataFrame) – list of signature genes for each cell type. (dim: [S, Cell_type])
img_metadata (dict) – Spatial information metadata (histology image, coordinates, scalefactor)

append_factors(arche_markers)[source]: Append list of archetypes (w/ corresponding markers) as additional cell type(s) / state(s) to the gene_sig

get_adata()[source]: Return adata after preprocessing & HVG gene selection

get_anchors()[source]: Return indices of anchor spots for each cell type

get_img_patches()[source]

replace_factors(factors_to_repl, arche_markers)[source]: Replace factor(s) with archetypes & their corresponding markers in the gene_sig

starfysh.utils.append_sigs(gene_sig, factor, sigs, n_genes=5)[source]: Append list of genes to a given cell type as additional signatures or add novel cell type / states & their signatures

starfysh.utils.extract_feature(adata, key)[source]: Extract generative / inference output from adata.obsm generate dummy tmp. adata for plotting

starfysh.utils.filter_gene_sig(gene_sig, adata_df)[source]

starfysh.utils.get_adata_wsig(adata, adata_norm, gene_sig)[source]: Select intersection of HVGs from dataset & signature annotations

starfysh.utils.get_alpha_min(sig_mean, pure_dict)[source]: Calculate alpha_min for Dirichlet dist. for each factor

starfysh.utils.get_anchor_spots(adata_sample, sig_mean, v_low=20, v_high=95, n_anchor=40)[source]

Calculate the top anchor spot enriched for the given cell type (determined by normalized expression values from each signature)

Parameters:

adata_sample (sc.Anndata) – ST raw count
v_low (int) – the low threshold to filter high-quality spots
v_high (int) – the high threshold to filter high-quality spots
n_anchor (int) – # anchor spots per cell type

Returns:

pure_spots (np.ndarray) – anchor spot indices per cell type (dim: [S, n_anchor])
pre_dict (dict) – Cell-type -> Anchor spots
adata_pure (np.ndarray) – Binary indicators of anchor spots (dim: [S, n_anchor])

starfysh.utils.get_simu_map_info(umap_plot)[source]

starfysh.utils.get_umap(adata_sample, display=False)[source]

starfysh.utils.get_windowed_library(adata_sample, map_info, library, window_size)[source]

starfysh.utils.init_weights(module)[source]

starfysh.utils.load_adata(data_folder, sample_id, n_genes, multiple_data=False)[source]

load visium adata with raw counts, preprocess & extract highly variable genes

Parameters:

data_folder (str) – Root directory of the data
sample_id (str) – Sample subdirectory under data_folder
n_genes (int) – the number of the gene for training
multiple_data (bool) – whether the study include multiple datasets

Returns:

adata (sc.AnnData) – Processed ST raw counts
adata_norm (sc.AnnData) – Processed ST normalized & log-transformed data

starfysh.utils.load_signatures(filename, adata)[source]

load annotated signature gene sets

Parameters:

filename (str) – Signature file
adata (sc.AnnData) – ST count matrix

Returns:

gene_sig – signatures per cell type / state

Return type:

pd.DataFrame

starfysh.utils.preprocess(adata_raw, lognorm=True, min_perc=None, max_perc=None, n_top_genes=6000, mt_thld=100, verbose=True, multiple_data=False)[source]

Preprocessing ST gexp matrix, remove Ribosomal & Mitochondrial genes

Parameters:

adata_raw (annData) – Spot x Bene raw expression matrix [S x G]
min_perc (float) – lower-bound percentile of non-zero gexps for filtering spots
max_perc (float) – upper-bound percentile of non-zero gexps for filtering spots
n_top_genes (float) – number of the variable genes
mt_thld (float) – max. percentage of mitochondrial gexps for filtering spots with excessive MT expressions
multiple_data (bool) – whether the study need integrate datasets

starfysh.utils.preprocess_img(data_path, sample_id, adata_index, hchannel=False)[source]

Load and preprocess visium paired H&E image & spatial coords

Parameters:

data_path (str) – Root directory of the data
sample_id (str) – Sample subdirectory under data_path
hchannel (bool) – Whether to apply binary color deconvolution to extract hematoxylin channel Please refer to: https://digitalslidearchive.github.io/HistomicsTK/examples/color_deconvolution.html

Returns:

adata_image (np.ndarray) – Processed histology image
map_info (np.ndarray) – Spatial coords of spots (dim: [S, 2])

starfysh.utils.refine_anchors(visium_args, aa_model, thld=0.35, n_genes=5, n_iters=1)[source]

Refine anchor spots & marker genes with archetypal analysis. We append DEGs computed from archetypes to their best-matched anchors followed by re-computing new anchor spots

Parameters:

visium_args (VisiumArgument) – Default parameter set for Starfysh upon dataloading
aa_model (ArchetypalAnalysis) – Pre-computed archetype object
thld (float) – Threshold cutoff for anchor-archetype mapping
n_genes (int) – # archetypal marker genes to append per refinement iteration

Returns:

visimu_args – updated parameter set for Starfysh

Return type:

VisiumArgument

starfysh.utils.run_starfysh(visium_args, n_repeats=3, lr=0.001, epochs=100, patience=10, poe=False, device=device(type='cpu'), verbose=True)[source]

Wrapper to run starfysh deconvolution.

Note: adding early-stopping mechanism evaluated by loss c - early-stopping with patience=10 - choose best among 3 rerun

Parameters:

visium_args (VisiumArguments) – Preprocessed metadata calculated from input visium matrix: e.g. mean signature expression, library size, anchor spots, etc.
n_repeats (int) – Number of restart to run Starfysh
epochs (int) – Max. number of iterations
patience (int) – Max. counts for early-stopping if q(c) doesn’t drop
poe (bool) – Whether to perform inference with Poe w/ image integration

Returns:

best_model (starfysh.AVAE or starfysh.AVAE_PoE) – Trained Starfysh model with deconvolution results
loss (np.ndarray) – Training losses

starfysh package

Submodules

starfysh.AA module

starfysh.dataloader module

starfysh.plot_utils module

starfysh.post_analysis module

starfysh.starfysh module

starfysh.utils module

Module contents