starfysh package

Submodules

starfysh.AA module

class starfysh.AA.ArchetypalAnalysis(adata_orig, u=None, u_3d=None, verbose=True, outdir=None, filename=None, savefig=False)[source]

Bases: object

assign_archetypes(anchor_df, threshold=0.2)[source]

Assign best 1-1 mapping of archetype community to its closest anchor community (cell-type specific anchor spots) With spot overlapping ratio >= threshold

Parameters:
  • anchor_df (pd.DataFrame) – Dataframe of anchor spot indices

  • threshold (float) – Threshold to determine anchor-archetype mapping

Returns:

  • map_df (pd.DataFrame) – DataFrame of overlapping spot ratio of each anchor i to archetype j

  • map_dict (dict) – Dictionary of cell type -> mapped archetype

compute_archetypes(cn=30, n_iters=20, converge=0.001, r=20, display=False)[source]

Estimate the upper bound of archetype count (k) by calculating intrinsic dimension Compute hierarchical archetypes (major + raw) with given granularity

cnint

Conditional Number to choose PCs for intrinsic estimator as lower bound # archetype estimation. Please refer to: https://scikit-dimension.readthedocs.io/en/latest/skdim.id.FisherS.html#skdim.id.FisherS

n_itersint

Max. # iterations of AA to find the best k estimation

convergeint

Convergence criteria for AA iteration with diff(explained variance)

` rint

Resolution parameter to control granularity of major archetypes If two archetypes reside within r nearest neighbors, the latter one will be merged.

displaybool

Whether to display Intrinsic Dimension (ID) estimation plots

archetypenp.ndarray (dim=[K, G])

Raw archetypes as linear combination of subset of spot counts

arche_dictdict

Hierarchical structure of major_archetype -> its fine-grained neighbor archetypes

major_idxint

Index of major archetypes among k raw candidates after merging

evslist

Explained variance with different Ks

find_archetypal_spots(n_neighbors=20, major=True)[source]

Assign N-nearest-neighbor spots to each archetype as archetypal spots (archetype community)

Parameters:
  • n_neighbors (int (default=40)) – N nearest neighbors of each archetype for archetypal spots

  • major (bool) – Whether to find NNs for only major archetypes

Returns:

arche_df – Dataframe of archetypal spots

Return type:

pd.DataFrame

find_distant_archetypes(anchor_df, map_dict=None, n=3)[source]

Sort and return top n archetypes that are unmapped and farthest from anchor spots of know cell types They are more likely to represent novel cell types / states

Parameters:
  • anchor_df (pd.DataFrame) – Dataframe of anchor spot indices

  • map_dict (dict) – Dictionary of cell type -> mapped archetype

  • n (int) – Number of distant archetypes to return

Returns:

distant_archetypes – List of archetype labels (farthest –> closest to anchors)

Return type:

list

find_markers(n_markers=30, display=False)[source]

Find marker genes for each archetype community via Wilcoxon rank sum test (in-group vs. out-of-group)

Parameters:

n_markers (int) – Number of top marker genes to find for each archetype community

Returns:

marker_df – Dataframe of marker genes for each archetype community

Return type:

pd.DataFrame

plot_anchor_archetype_clusters(anchor_df, cell_types=None, arche_lbls=None, lgd_ncol=2, do_3d=False)[source]

Joint display subset of anchor spots & archetypal spots (to visualize overlapping degree)

plot_archetypes(major=True, do_3d=False, lgd_ncol=1, figsize=(6, 4), disp_cluster=True, disp_arche=True)[source]

Display archetype & archetypal spot communities

plot_mapping(map_df, figsize=(6, 5))[source]

Display anchor - archetype mapping (overlapping # spot ratio)

starfysh.dataloader module

class starfysh.dataloader.VisiumDataset(adata, args)[source]

Bases: Dataset

Loading preprocessed Visium AnnData, gene signature & Anchor spots for Starfysh training

class starfysh.dataloader.VisiumPoEDataSet(adata, args)[source]

Bases: VisiumDataset

return the data stack with expression and image

starfysh.plot_utils module

starfysh.plot_utils.pl_spatial_inf_feature(adata, feature, factor=None, vmin=0, vmax=None, spot_size=100, alpha=0, cmap='Spectral_r')[source]

Spatial visualization of Starfysh inference features

starfysh.plot_utils.pl_spatial_inf_gene(adata, factor, feature, vmin=0, vmax=None, spot_size=100, alpha=0, cmap='Spectral_r')[source]
starfysh.plot_utils.pl_umap_feature(qz_u, qc, cmap, title, spot_size=3, vmin=0, vmax=None)[source]

Single Z-UMAP visualization of Starfysh deconvolutions

starfysh.plot_utils.plot_anchor_spots(umap_plot, pure_spots, sig_mean, bbox_x=2)[source]
starfysh.plot_utils.plot_evs(evs, kmin)[source]
starfysh.plot_utils.plot_spatial_feature(adata_sample, map_info, variable, label)[source]
starfysh.plot_utils.plot_spatial_gene(adata_sample, map_info, gene_name)[source]

starfysh.post_analysis module

starfysh.post_analysis.create_corr_network_5(G, node_size_list, corr_direction, min_correlation)[source]
starfysh.post_analysis.display_reconst(df_true, df_pred, density=False, marker_genes=None, sample_rate=0.1, size=(3, 3), spot_size=1, title=None, x_label='', y_label='', x_min=0, x_max=10, y_min=0, y_max=10)[source]

Scatter plot - raw gexp vs. reconstructed gexp

starfysh.post_analysis.gene_mean_vs_inferred_prop(inference_outputs, visium_args, idx)[source]
starfysh.post_analysis.get_LISA(W, X)[source]
starfysh.post_analysis.get_Moran(W, X)[source]
starfysh.post_analysis.get_SCI(W, X, Y)[source]
starfysh.post_analysis.get_adata(sample_ids, data_folder)[source]
starfysh.post_analysis.get_cormtx(sample_id, hub_num)[source]
starfysh.post_analysis.get_corr_map(inference_outputs, proportions)[source]
starfysh.post_analysis.get_factor_dist(sample_ids, file_path)[source]
starfysh.post_analysis.get_hub_cormtx(sample_ids, hub_num)[source]
starfysh.post_analysis.get_z_umap(qz_m)[source]
starfysh.post_analysis.plot_density(results, category_names)[source]
Parameters:
  • results (dict) – A mapping from question labels to a list of answers per category. It is assumed all lists contain the same number of entries and that it matches the length of category_names.

  • category_names (list of str) – The category labels.

starfysh.post_analysis.plot_stacked_prop(results, category_names)[source]
Parameters:
  • results (dict) – A mapping from question labels to a list of answers per category. It is assumed all lists contain the same number of entries and that it matches the length of category_names.

  • category_names (list of str) – The category labels.

starfysh.post_analysis.plot_type_all(inference_outputs, u, proportions)[source]

starfysh.starfysh module

class starfysh.starfysh.AVAE(adata, gene_sig, win_loglib)[source]

Bases: Module

Model design

p(x|z)=f(z) p(z|x)~N(0,1) q(z|x)~g(x)

generative(inference_outputs, xs_k)[source]
get_loss(generative_outputs, inference_outputs, x, x_peri, library, device)[source]
inference(x)[source]
reparameterize(mu, log_var)[source]
Parameters:
  • mu – mean from the encoder’s latent space

  • log_var – log variance from the encoder’s latent space

training: bool
class starfysh.starfysh.AVAE_PoE(adata, gene_sig, patch_r, win_loglib)[source]

Bases: Module

Model design:

p(x|z)=f(z) p(z|x)~N(0,1) q(z|x)~g(x)

generative(inference_outputs, xs_k, img_path_outputs)[source]
xs_ktorch.Tensor

Z-normed avg. gene exprs

get_loss(generative_outputs, inference_outputs, img_path_outputs, poe_path_outputs, x, x_peri, library, adata_img, device)[source]
inference(x)[source]
predict_imgVAE(x)[source]
predictor_POE(inference_outputs, exp_path_outputs, img_path_outputs)[source]
reparameterize(mu, log_var)[source]
Parameters:
  • mu – mean from the encoder’s latent space

  • log_var – log variance from the encoder’s latent space

training: bool
class starfysh.starfysh.NegBinom(mu, theta, eps=1e-10)[source]

Bases: Distribution

Gamma-Poisson mixture approximation of Negative Binomial(mean, dispersion)

lambda ~ Gamma(mu, theta) x ~ Poisson(lambda)

arg_constraints = {'mu': GreaterThanEq(lower_bound=0), 'theta': GreaterThanEq(lower_bound=0)}
log_prob(x)[source]

log-likelihood

sample()[source]

Generates a sample_shape shaped sample or sample_shape shaped batch of samples if the distribution parameters are batched.

support = IntegerGreaterThan(lower_bound=0)
starfysh.starfysh.model_ct_exp(model, adata, visium_args, poe=False, device=device(type='cpu'))[source]

Obtain predicted cell-type specific expression in each spot

starfysh.starfysh.model_eval(model, adata, visium_args, poe=False, device=device(type='cpu'))[source]
starfysh.starfysh.train(model, dataloader, device, optimizer)[source]
starfysh.starfysh.train_poe(model, dataloader, device, optimizer)[source]
starfysh.starfysh.valid_model(model)[source]

starfysh.utils module

class starfysh.utils.VisiumArguments(adata, adata_norm, gene_sig, img_metadata, **kwargs)[source]

Bases: object

Loading Visium AnnData, perform preprocessing, library-size smoothing & Anchor spot detection

Parameters:
  • adata (AnnData) – annotated visium count matrix

  • adata_norm (AnnData) – annotated visium count matrix after normalization & log-transform

  • gene_sig (pd.DataFrame) – list of signature genes for each cell type. (dim: [S, Cell_type])

  • img_metadata (dict) – Spatial information metadata (histology image, coordinates, scalefactor)

append_factors(arche_markers)[source]

Append list of archetypes (w/ corresponding markers) as additional cell type(s) / state(s) to the gene_sig

get_adata()[source]

Return adata after preprocessing & HVG gene selection

get_anchors()[source]

Return indices of anchor spots for each cell type

get_img_patches()[source]
replace_factors(factors_to_repl, arche_markers)[source]

Replace factor(s) with archetypes & their corresponding markers in the gene_sig

starfysh.utils.append_sigs(gene_sig, factor, sigs, n_genes=5)[source]

Append list of genes to a given cell type as additional signatures or add novel cell type / states & their signatures

starfysh.utils.extract_feature(adata, key)[source]

Extract generative / inference output from adata.obsm generate dummy tmp. adata for plotting

starfysh.utils.filter_gene_sig(gene_sig, adata_df)[source]
starfysh.utils.get_adata_wsig(adata, adata_norm, gene_sig)[source]

Select intersection of HVGs from dataset & signature annotations

starfysh.utils.get_alpha_min(sig_mean, pure_dict)[source]

Calculate alpha_min for Dirichlet dist. for each factor

starfysh.utils.get_anchor_spots(adata_sample, sig_mean, v_low=20, v_high=95, n_anchor=40)[source]

Calculate the top anchor spot enriched for the given cell type (determined by normalized expression values from each signature)

Parameters:
  • adata_sample (sc.Anndata) – ST raw count

  • v_low (int) – the low threshold to filter high-quality spots

  • v_high (int) – the high threshold to filter high-quality spots

  • n_anchor (int) – # anchor spots per cell type

Returns:

  • pure_spots (np.ndarray) – anchor spot indices per cell type (dim: [S, n_anchor])

  • pre_dict (dict) – Cell-type -> Anchor spots

  • adata_pure (np.ndarray) – Binary indicators of anchor spots (dim: [S, n_anchor])

starfysh.utils.get_simu_map_info(umap_plot)[source]
starfysh.utils.get_umap(adata_sample, display=False)[source]
starfysh.utils.get_windowed_library(adata_sample, map_info, library, window_size)[source]
starfysh.utils.init_weights(module)[source]
starfysh.utils.load_adata(data_folder, sample_id, n_genes, multiple_data=False)[source]

load visium adata with raw counts, preprocess & extract highly variable genes

Parameters:
  • data_folder (str) – Root directory of the data

  • sample_id (str) – Sample subdirectory under data_folder

  • n_genes (int) – the number of the gene for training

  • multiple_data (bool) – whether the study include multiple datasets

Returns:

  • adata (sc.AnnData) – Processed ST raw counts

  • adata_norm (sc.AnnData) – Processed ST normalized & log-transformed data

starfysh.utils.load_signatures(filename, adata)[source]

load annotated signature gene sets

Parameters:
  • filename (str) – Signature file

  • adata (sc.AnnData) – ST count matrix

Returns:

gene_sig – signatures per cell type / state

Return type:

pd.DataFrame

starfysh.utils.preprocess(adata_raw, lognorm=True, min_perc=None, max_perc=None, n_top_genes=6000, mt_thld=100, verbose=True, multiple_data=False)[source]

Preprocessing ST gexp matrix, remove Ribosomal & Mitochondrial genes

Parameters:
  • adata_raw (annData) – Spot x Bene raw expression matrix [S x G]

  • min_perc (float) – lower-bound percentile of non-zero gexps for filtering spots

  • max_perc (float) – upper-bound percentile of non-zero gexps for filtering spots

  • n_top_genes (float) – number of the variable genes

  • mt_thld (float) – max. percentage of mitochondrial gexps for filtering spots with excessive MT expressions

  • multiple_data (bool) – whether the study need integrate datasets

starfysh.utils.preprocess_img(data_path, sample_id, adata_index, hchannel=False)[source]

Load and preprocess visium paired H&E image & spatial coords

Parameters:
Returns:

  • adata_image (np.ndarray) – Processed histology image

  • map_info (np.ndarray) – Spatial coords of spots (dim: [S, 2])

starfysh.utils.refine_anchors(visium_args, aa_model, thld=0.35, n_genes=5, n_iters=1)[source]

Refine anchor spots & marker genes with archetypal analysis. We append DEGs computed from archetypes to their best-matched anchors followed by re-computing new anchor spots

Parameters:
  • visium_args (VisiumArgument) – Default parameter set for Starfysh upon dataloading

  • aa_model (ArchetypalAnalysis) – Pre-computed archetype object

  • thld (float) – Threshold cutoff for anchor-archetype mapping

  • n_genes (int) – # archetypal marker genes to append per refinement iteration

Returns:

visimu_args – updated parameter set for Starfysh

Return type:

VisiumArgument

starfysh.utils.run_starfysh(visium_args, n_repeats=3, lr=0.001, epochs=100, patience=10, poe=False, device=device(type='cpu'), verbose=True)[source]

Wrapper to run starfysh deconvolution.

Note: adding early-stopping mechanism evaluated by loss c - early-stopping with patience=10 - choose best among 3 rerun

Parameters:
  • visium_args (VisiumArguments) – Preprocessed metadata calculated from input visium matrix: e.g. mean signature expression, library size, anchor spots, etc.

  • n_repeats (int) – Number of restart to run Starfysh

  • epochs (int) – Max. number of iterations

  • patience (int) – Max. counts for early-stopping if q(c) doesn’t drop

  • poe (bool) – Whether to perform inference with Poe w/ image integration

Returns:

  • best_model (starfysh.AVAE or starfysh.AVAE_PoE) – Trained Starfysh model with deconvolution results

  • loss (np.ndarray) – Training losses

Module contents