starfysh package
Submodules
starfysh.AA module
- class starfysh.AA.ArchetypalAnalysis(adata_orig, u=None, u_3d=None, verbose=True, outdir=None, filename=None, savefig=False)[source]
Bases:
object- assign_archetypes(anchor_df, threshold=0.2)[source]
Assign best 1-1 mapping of archetype community to its closest anchor community (cell-type specific anchor spots) With spot overlapping ratio >= threshold
- Parameters:
anchor_df (pd.DataFrame) – Dataframe of anchor spot indices
threshold (float) – Threshold to determine anchor-archetype mapping
- Returns:
map_df (pd.DataFrame) – DataFrame of overlapping spot ratio of each anchor i to archetype j
map_dict (dict) – Dictionary of cell type -> mapped archetype
- compute_archetypes(cn=30, n_iters=20, converge=0.001, r=20, display=False)[source]
Estimate the upper bound of archetype count (k) by calculating intrinsic dimension Compute hierarchical archetypes (major + raw) with given granularity
- cnint
Conditional Number to choose PCs for intrinsic estimator as lower bound # archetype estimation. Please refer to: https://scikit-dimension.readthedocs.io/en/latest/skdim.id.FisherS.html#skdim.id.FisherS
- n_itersint
Max. # iterations of AA to find the best k estimation
- convergeint
Convergence criteria for AA iteration with diff(explained variance)
- ` rint
Resolution parameter to control granularity of major archetypes If two archetypes reside within r nearest neighbors, the latter one will be merged.
- displaybool
Whether to display Intrinsic Dimension (ID) estimation plots
- archetypenp.ndarray (dim=[K, G])
Raw archetypes as linear combination of subset of spot counts
- arche_dictdict
Hierarchical structure of major_archetype -> its fine-grained neighbor archetypes
- major_idxint
Index of major archetypes among k raw candidates after merging
- evslist
Explained variance with different Ks
- find_archetypal_spots(n_neighbors=20, major=True)[source]
Assign N-nearest-neighbor spots to each archetype as archetypal spots (archetype community)
- Parameters:
n_neighbors (int (default=40)) – N nearest neighbors of each archetype for archetypal spots
major (bool) – Whether to find NNs for only major archetypes
- Returns:
arche_df – Dataframe of archetypal spots
- Return type:
pd.DataFrame
- find_distant_archetypes(anchor_df, map_dict=None, n=3)[source]
Sort and return top n archetypes that are unmapped and farthest from anchor spots of know cell types They are more likely to represent novel cell types / states
- Parameters:
anchor_df (pd.DataFrame) – Dataframe of anchor spot indices
map_dict (dict) – Dictionary of cell type -> mapped archetype
n (int) – Number of distant archetypes to return
- Returns:
distant_archetypes – List of archetype labels (farthest –> closest to anchors)
- Return type:
list
- find_markers(n_markers=30, display=False)[source]
Find marker genes for each archetype community via Wilcoxon rank sum test (in-group vs. out-of-group)
- Parameters:
n_markers (int) – Number of top marker genes to find for each archetype community
- Returns:
marker_df – Dataframe of marker genes for each archetype community
- Return type:
pd.DataFrame
- plot_anchor_archetype_clusters(anchor_df, cell_types=None, arche_lbls=None, lgd_ncol=2, do_3d=False)[source]
Joint display subset of anchor spots & archetypal spots (to visualize overlapping degree)
starfysh.dataloader module
- class starfysh.dataloader.VisiumDataset(adata, args)[source]
Bases:
DatasetLoading preprocessed Visium AnnData, gene signature & Anchor spots for Starfysh training
- class starfysh.dataloader.VisiumPoEDataSet(adata, args)[source]
Bases:
VisiumDatasetreturn the data stack with expression and image
starfysh.plot_utils module
- starfysh.plot_utils.pl_spatial_inf_feature(adata, feature, factor=None, vmin=0, vmax=None, spot_size=100, alpha=0, cmap='Spectral_r')[source]
Spatial visualization of Starfysh inference features
- starfysh.plot_utils.pl_spatial_inf_gene(adata, factor, feature, vmin=0, vmax=None, spot_size=100, alpha=0, cmap='Spectral_r')[source]
starfysh.post_analysis module
- starfysh.post_analysis.create_corr_network_5(G, node_size_list, corr_direction, min_correlation)[source]
- starfysh.post_analysis.display_reconst(df_true, df_pred, density=False, marker_genes=None, sample_rate=0.1, size=(3, 3), spot_size=1, title=None, x_label='', y_label='', x_min=0, x_max=10, y_min=0, y_max=10)[source]
Scatter plot - raw gexp vs. reconstructed gexp
- starfysh.post_analysis.plot_density(results, category_names)[source]
- Parameters:
results (dict) – A mapping from question labels to a list of answers per category. It is assumed all lists contain the same number of entries and that it matches the length of category_names.
category_names (list of str) – The category labels.
- starfysh.post_analysis.plot_stacked_prop(results, category_names)[source]
- Parameters:
results (dict) – A mapping from question labels to a list of answers per category. It is assumed all lists contain the same number of entries and that it matches the length of category_names.
category_names (list of str) – The category labels.
starfysh.starfysh module
- class starfysh.starfysh.AVAE(adata, gene_sig, win_loglib)[source]
Bases:
Module- Model design
p(x|z)=f(z) p(z|x)~N(0,1) q(z|x)~g(x)
- reparameterize(mu, log_var)[source]
- Parameters:
mu – mean from the encoder’s latent space
log_var – log variance from the encoder’s latent space
- training: bool
- class starfysh.starfysh.AVAE_PoE(adata, gene_sig, patch_r, win_loglib)[source]
Bases:
Module- Model design:
p(x|z)=f(z) p(z|x)~N(0,1) q(z|x)~g(x)
- generative(inference_outputs, xs_k, img_path_outputs)[source]
- xs_ktorch.Tensor
Z-normed avg. gene exprs
- get_loss(generative_outputs, inference_outputs, img_path_outputs, poe_path_outputs, x, x_peri, library, adata_img, device)[source]
- reparameterize(mu, log_var)[source]
- Parameters:
mu – mean from the encoder’s latent space
log_var – log variance from the encoder’s latent space
- training: bool
- class starfysh.starfysh.NegBinom(mu, theta, eps=1e-10)[source]
Bases:
DistributionGamma-Poisson mixture approximation of Negative Binomial(mean, dispersion)
lambda ~ Gamma(mu, theta) x ~ Poisson(lambda)
- arg_constraints = {'mu': GreaterThanEq(lower_bound=0), 'theta': GreaterThanEq(lower_bound=0)}
- sample()[source]
Generates a sample_shape shaped sample or sample_shape shaped batch of samples if the distribution parameters are batched.
- support = IntegerGreaterThan(lower_bound=0)
- starfysh.starfysh.model_ct_exp(model, adata, visium_args, poe=False, device=device(type='cpu'))[source]
Obtain predicted cell-type specific expression in each spot
starfysh.utils module
- class starfysh.utils.VisiumArguments(adata, adata_norm, gene_sig, img_metadata, **kwargs)[source]
Bases:
objectLoading Visium AnnData, perform preprocessing, library-size smoothing & Anchor spot detection
- Parameters:
adata (AnnData) – annotated visium count matrix
adata_norm (AnnData) – annotated visium count matrix after normalization & log-transform
gene_sig (pd.DataFrame) – list of signature genes for each cell type. (dim: [S, Cell_type])
img_metadata (dict) – Spatial information metadata (histology image, coordinates, scalefactor)
- starfysh.utils.append_sigs(gene_sig, factor, sigs, n_genes=5)[source]
Append list of genes to a given cell type as additional signatures or add novel cell type / states & their signatures
- starfysh.utils.extract_feature(adata, key)[source]
Extract generative / inference output from adata.obsm generate dummy tmp. adata for plotting
- starfysh.utils.get_adata_wsig(adata, adata_norm, gene_sig)[source]
Select intersection of HVGs from dataset & signature annotations
- starfysh.utils.get_alpha_min(sig_mean, pure_dict)[source]
Calculate alpha_min for Dirichlet dist. for each factor
- starfysh.utils.get_anchor_spots(adata_sample, sig_mean, v_low=20, v_high=95, n_anchor=40)[source]
Calculate the top anchor spot enriched for the given cell type (determined by normalized expression values from each signature)
- Parameters:
adata_sample (sc.Anndata) – ST raw count
v_low (int) – the low threshold to filter high-quality spots
v_high (int) – the high threshold to filter high-quality spots
n_anchor (int) – # anchor spots per cell type
- Returns:
pure_spots (np.ndarray) – anchor spot indices per cell type (dim: [S, n_anchor])
pre_dict (dict) – Cell-type -> Anchor spots
adata_pure (np.ndarray) – Binary indicators of anchor spots (dim: [S, n_anchor])
- starfysh.utils.load_adata(data_folder, sample_id, n_genes, multiple_data=False)[source]
load visium adata with raw counts, preprocess & extract highly variable genes
- Parameters:
data_folder (str) – Root directory of the data
sample_id (str) – Sample subdirectory under data_folder
n_genes (int) – the number of the gene for training
multiple_data (bool) – whether the study include multiple datasets
- Returns:
adata (sc.AnnData) – Processed ST raw counts
adata_norm (sc.AnnData) – Processed ST normalized & log-transformed data
- starfysh.utils.load_signatures(filename, adata)[source]
load annotated signature gene sets
- Parameters:
filename (str) – Signature file
adata (sc.AnnData) – ST count matrix
- Returns:
gene_sig – signatures per cell type / state
- Return type:
pd.DataFrame
- starfysh.utils.preprocess(adata_raw, lognorm=True, min_perc=None, max_perc=None, n_top_genes=6000, mt_thld=100, verbose=True, multiple_data=False)[source]
Preprocessing ST gexp matrix, remove Ribosomal & Mitochondrial genes
- Parameters:
adata_raw (annData) – Spot x Bene raw expression matrix [S x G]
min_perc (float) – lower-bound percentile of non-zero gexps for filtering spots
max_perc (float) – upper-bound percentile of non-zero gexps for filtering spots
n_top_genes (float) – number of the variable genes
mt_thld (float) – max. percentage of mitochondrial gexps for filtering spots with excessive MT expressions
multiple_data (bool) – whether the study need integrate datasets
- starfysh.utils.preprocess_img(data_path, sample_id, adata_index, hchannel=False)[source]
Load and preprocess visium paired H&E image & spatial coords
- Parameters:
data_path (str) – Root directory of the data
sample_id (str) – Sample subdirectory under data_path
hchannel (bool) – Whether to apply binary color deconvolution to extract hematoxylin channel Please refer to: https://digitalslidearchive.github.io/HistomicsTK/examples/color_deconvolution.html
- Returns:
adata_image (np.ndarray) – Processed histology image
map_info (np.ndarray) – Spatial coords of spots (dim: [S, 2])
- starfysh.utils.refine_anchors(visium_args, aa_model, thld=0.35, n_genes=5, n_iters=1)[source]
Refine anchor spots & marker genes with archetypal analysis. We append DEGs computed from archetypes to their best-matched anchors followed by re-computing new anchor spots
- Parameters:
visium_args (VisiumArgument) – Default parameter set for Starfysh upon dataloading
aa_model (ArchetypalAnalysis) – Pre-computed archetype object
thld (float) – Threshold cutoff for anchor-archetype mapping
n_genes (int) – # archetypal marker genes to append per refinement iteration
- Returns:
visimu_args – updated parameter set for Starfysh
- Return type:
VisiumArgument
- starfysh.utils.run_starfysh(visium_args, n_repeats=3, lr=0.001, epochs=100, patience=10, poe=False, device=device(type='cpu'), verbose=True)[source]
Wrapper to run starfysh deconvolution.
Note: adding early-stopping mechanism evaluated by loss c - early-stopping with patience=10 - choose best among 3 rerun
- Parameters:
visium_args (VisiumArguments) – Preprocessed metadata calculated from input visium matrix: e.g. mean signature expression, library size, anchor spots, etc.
n_repeats (int) – Number of restart to run Starfysh
epochs (int) – Max. number of iterations
patience (int) – Max. counts for early-stopping if q(c) doesn’t drop
poe (bool) – Whether to perform inference with Poe w/ image integration
- Returns:
best_model (starfysh.AVAE or starfysh.AVAE_PoE) – Trained Starfysh model with deconvolution results
loss (np.ndarray) – Training losses