netflow.probe.summary#
This script includes methods for describing resulting branches / clusters / co-localization on the POSE.
Functions
|
Performs gene set enrichment via gseapy.enrichr with detailed feature sizes. |
|
Extract subset of samples (i.e., rows) based on threshold. |
Compute correlation between features and global node ordering. |
|
|
Compute correlation between features and branch node ordering. |
|
Order observations on each branch by distance from the branch tip nearest to the root. |
|
Order nodes by (weighted) graph distance from the root. |
|
Compute distances between all node pairs on a graph. |
|
Get list of gene libraries for GSEA analysis using gseapy. |
|
Perform GSEA summary of significant genes for each group |
|
Compute correlations between pairs of features on each branch. |
|
Compute correlation between feature pairs, sorted by global node order. |
|
Identify features that are significant in exactly one group and not significant in any other group. |
- netflow.probe.summary.enrichr(
- **kwargs,
Performs gene set enrichment via gseapy.enrichr with detailed feature sizes.
- Parameters:
kwargs (dict) – Keyword arguments passed to `gseapy.enrichr.
- Returns:
results – Dataframe with results returned from gseapy.enrichr analysis, with additional “Overlap” information columns:
”n” : The number of genes in the provided gene list that overlap with the library.
”N” : The library size (number of genes in the library).
”n/N” : The ratio of overlap of provided genes that are in the library to the library size.
The following columns returned by gseapy.enrichr are dropped:
”Old P-value”
”Old Adjusted P-value”
- Return type:
pd.DataFrame
Examples
>>> libraries = ['GO_Biological_Process_2025', 'GO_Cellular_Component_2025', 'GO_Molecular_Function_2025', 'Human_Phenotype_Ontology', 'MSigDB_Hallmark_2020', 'MSigDB_Oncogenic_Signatures', 'Reactome_Pathways_2024', ] >>> G = networkx.Graph(HPRD) # A gene-gene interaction graph based on HPRD >>> gl = ['TOP2A', 'FEN1', 'BLM', 'MCM7', 'MCM8', 'XPC', 'MCM10', 'BRCA1', 'TTF2', ...] >>> enrichr(gene_list=gl, # gene_list, # or "./tests/data/gene_list.txt", gene_sets=libraries, # ['MSigDB_Hallmark_2020','KEGG_2021_Human'], organism='human', background=list(G), # or "hsapiens_gene_ensembl", or int, or text file, or a list of genes outdir=None, # if None, don't write to disk )
- netflow.probe.summary.extract_features( ) List[str][source]#
Extract subset of samples (i.e., rows) based on threshold.
- Parameters:
df (pd.DataFrame) – The data, expected to have rows as samples with columns as features on which to threshold.
metric (str) – The column header to use as metric for determining samples to keep.
threshold ({float, str}) –
Threshold applied to metric, depends on type:
- floatIf
apply_lt = True, a sample is kept if metric_value <= threshold. If
apply_lt = False, a sample is kept if metric_value >= threshold.
- floatIf
str : A sample is kept if metric_value = threshold.
apply_lt (bool, default = True) – Indicate how to compare numeric metric values to the threshold. If True, retain metric values that are less than (lt)
threshold. Otherwise, retain values greater thanthreshold. Ignored ifthresholdis a str.
- Returns:
selected – The list of sample index labels that match the threshold
- Return type:
List
- netflow.probe.summary.feature_graph_order_correlation_global(
- poser,
- graph_nw,
- data_df,
- obs_labels,
- weights=None,
Compute correlation between features and global node ordering.
Parameters:#
- poser: netflow.pose.POSER
The object used to construct the POSE.
- graph_nw: networkx.Graph
The POSE graph.
- data_df: pandas.DataFrame (n_features, n_observations)
Feature matrix.
- obs_labels: list of str
List of observation labels corresponding to node IDs.
- weights: {None, pandas.DataFrame}
Dataframe of edge weights between nodes (observations) used to compute weighted hop distance if provided. Unweighted hop count is used if None.
Returns:#
- corr_df: pandas.DataFrame
Dataframe containing spearman correlation between features and global ordering of nodes, and associated p-values.
- netflow.probe.summary.feature_graph_order_correlation_local(
- poser,
- graph_nw,
- data_df,
- obs_labels,
- weights=None,
- min_branch_size=None,
Compute correlation between features and branch node ordering.
Parameters:#
- poser: netflow.pose.POSER
The object used to construct the POSE.
- graph_nw: networkx.Graph
The POSE graph.
- data_df: pandas.DataFrame (n_features, n_observations)
Feature matrix.
- obs_labels: list of str
List of observation labels corresponding to node IDs.
- weights: {None, pandas.DataFrame, (n, n)}
Dataframe of edge weights between nodes (observations) used to compute weighted hop distance if provided. Unweighted hop count is used if None.
- min_branch_size: {None, int}
Skip branches with <=
min_branch_sizeobservations.
Returns:#
- corr_dict: dict of pandas.DataFrame.
Dictionary mapping branch IDs to a pd.DataFrame. Each dataframe contains spearman correlation between features and global ordering of nodes, and associated p-values.
- netflow.probe.summary.get_branch_node_order(
- poser,
- graph_nw,
- weights=None,
- min_branch_size=3,
Order observations on each branch by distance from the branch tip nearest to the root.
Parameters:#
- poser: netflow.pose.POSER
The object used to construct the POSE.
- graph_nwnetworkx.Graph
The POSE graph.
- weights: {None, pandas.DataFrame, (n, n)}
Dataframe of edge weights between nodes (observations). If None unweighted hop count is used.
- min_branch_size: {None, int}
Skip branches with <=
min_branch_sizeobservations.
Returns:#
branch_ord_dict: dict Dictionary mapping branch node indices to ordering in terms of graph distance from branch tip nearest root. Uses weighted distance if provided. Otherwise, uses hop distance if weights is None
- netflow.probe.summary.get_global_node_order(
- poser,
- graph_nw,
- weights=None,
Order nodes by (weighted) graph distance from the root.
Parameters:#
- poser: netflow.pose.POSER
The object used to construct the POSE, containing the root node ID.
- graph_nw: networkx.Graph
Input POSE topology.
- weights: {None, pandas.DataFrame, (n, n)}
Dataframe of edge weights between nodes (observations). If None unweighted hop count is used.
Returns:#
node_ord_dict: dict Dictionary mapping node indices to ordering in terms of graph distance from the root. Ordering is based on weighted distance if provided. Otherwise, based on hop distance if weights is None
- netflow.probe.summary.get_graph_distances(
- graph_nw,
- weights=None,
Compute distances between all node pairs on a graph.
Parameters:#
- graph_nw: networkx.Graph
Input graph.
- weights: {None, pandas.DataFrame, (n, n)}
Dataframe of edge weights between nodes (observations). If None unweighted hop count is used.
Returns:#
graph_dist: numpy.ndarray, (n, n) Matrix of pairwise graph distances between nodes ordered 0,1,…,n-1. Returns weighted distance if provided. Otherwise, returns hop distance if weights is None
- netflow.probe.summary.get_gsea_library_names(
- organism='Human',
Get list of gene libraries for GSEA analysis using gseapy.
This returns active enrichr library names that can be found at: https://maayanlab.cloud/modEnrichr/.
- Parameters:
organism ({'Human', 'Mouse', 'Yeast', 'Fly', 'Fish', 'Worm'}, default='Human') – The database to pull libraries from.
- Returns:
libraries – The list of library names.
- Return type:
list
- netflow.probe.summary.gsea_group_summary(
- records_full: Dict[Any, DataFrame] | DataFrame,
- metric: Literal['pval', 'qval'] = 'qval',
- threshold: float = 0.05,
- dropna: bool = True,
- **kwargs,
Perform GSEA summary of significant genes for each group
This function consumes the output of netflow.methods.stats.sig_feats_by_group() in either of its supported formats:
Dictionary format:
records_full[group] -> DataFrame(index=features, columns include metric)where metric is either “pval” (p-value) or “qval” (corrected p-value).Wide DataFrame format: A DataFrame indexed by features with MultiIndex columns of the form
(group, metric_name), where metric_name includes “pval” and/or “qval”.
The output is a dictionary mapping each group to a pd.DataFrame with the gsea output of its significant genes provided by enrichr()
- Parameters:
records_full (dict or pandas.DataFrame) –
Results returned by sig_feats_by_group().
If dict: keys are group labels; values are per-group DataFrames.
If DataFrame: must have MultiIndex columns with level 0 = group and level 1 = metric.
metric ({"pval", "qval"}, default="qval") –
Which significance column to use when deciding whether a feature is significant.
”pval”: raw p-values
”qval”: multiple-testing corrected p-values (q-values)
threshold (float, default=0.05) – Significance threshold applied to metric. A feature is considered significant for a given group if metric_value <= threshold.
dropna (bool, default=True) – If True, treat NaN metric values as “not significant”. If False, NaNs propagate only through comparisons (NaN <= threshold is False anyway), so behavior is effectively the same for the significance mask; this mainly affects internal bookkeeping.
kwargs (dict) – Optional keyword arguments passed to enrichr().
- Returns:
summary – Dictionary keyed by group label, where values are dataframes of gsea results.
- Return type:
Notes
If you ran netflow.methods.stats.sig_feats_by_group(…, top_n=…), then records_full may only include a subset of features per group. In that case, “unique” is evaluated relative to what is present in records_full (missing features are treated as not significant for that group). For strict uniqueness across all tested features, generate results without top_n. It is therefore recommended and expected that results were generated without top_n.
If netflow.methods.stats.sig_feats_by_group() used the two-category optimization (compute only one direction), then records_full may not contain both groups; uniqueness can only be assessed across the groups that are present in records_full.
- netflow.probe.summary.ordered_features_correlation_branch(
- poser,
- graph_nw,
- data_df,
- obs_labels,
- weights=None,
- min_branch_size=3,
Compute correlations between pairs of features on each branch.
Parameters:#
- poser: netflow.pose.POSER
The object used to construct the POSE.
- graph_nw: networkx.Graph
The POSE graph.
- data_df: pandas.DataFrame (n_features, n_observations)
Feature matrix.
- obs_labels: list of str
List of observation labels corresponding to node IDs.
- weights: {None, pandas.DataFrame, (n, n)}
Dataframe of edge weights between nodes (observations). If None unweighted hop count is used.
- min_branch_size: {None, int}
Skip branches with <=
min_branch_sizeobservations.
Returns:#
- corr_dict: dict
Dictionary of correlations between feature pairs sorted by node order on each branch. Node order is based on weighted distance if provided. Otherwise, based on hop distance if weights is None
- netflow.probe.summary.ordered_features_correlation_global(
- poser,
- graph_nw,
- data_df,
- obs_labels,
- weights=None,
Compute correlation between feature pairs, sorted by global node order.
- poser: netflow.pose.POSER
The object used to construct the POSE.
- graph_nw: networkx.Graph
The POSE graph.
- data_df: pandas.DataFrame (n_features, n_observations)
Feature matrix.
- obs_labels: list of str
List of observation labels corresponding to node IDs.
- weights: {None, pandas.DataFrame, (n, n)}
Dataframe of edge weights between nodes (observations). If None unweighted hop count is used.
Returns:
Node order is based on weighted distance if provided. Otherwise, based on hop distance if weights is None
- netflow.probe.summary.unique_significant_features_by_group(
- records_full: Dict[Any, DataFrame] | DataFrame,
- *,
- metric: Literal['pval', 'qval'] = 'qval',
- threshold: float = 0.05,
- dropna: bool = True,
- sort_within_group: bool = True,
Identify features that are significant in exactly one group and not significant in any other group.
This function consumes the output of netflow.methods.stats.sig_feats_by_group() in either of its supported formats:
Dictionary format:
records_full[group] -> DataFrame(index=features, columns include metric)where metric is either “pval” (p-value) or “qval” (corrected p-value).Wide DataFrame format: A DataFrame indexed by features with MultiIndex columns of the form
(group, metric_name), where metric_name includes “pval” and/or “qval”.
The output is a dictionary mapping each group to a list of features that satisfy:
feature is significant for that group according to the chosen metric and threshold, and
feature is NOT significant for every other group in records_full.
- Parameters:
records_full (dict or pandas.DataFrame) –
Results returned by sig_feats_by_group().
If dict: keys are group labels; values are per-group DataFrames.
If DataFrame: must have MultiIndex columns with level 0 = group and level 1 = metric.
metric ({"pval", "qval"}, default="qval") –
Which significance column to use when deciding whether a feature is significant.
”pval”: raw p-values
”qval”: multiple-testing corrected p-values (q-values)
threshold (float, default=0.05) – Significance threshold applied to metric. A feature is considered significant for a given group if metric_value <= threshold.
dropna (bool, default=True) – If True, treat NaN metric values as “not significant”. If False, NaNs propagate only through comparisons (NaN <= threshold is False anyway), so behavior is effectively the same for the significance mask; this mainly affects internal bookkeeping.
sort_within_group (bool, default=True) – If True, features returned per group are sorted by the chosen metric ascending (most significant first). If False, features follow the underlying feature index order.
- Returns:
unique_feats – Dictionary keyed by group label, where values are lists of feature names that are significant only for that group and for no other group.
- Return type:
- Raises:
TypeError – If records_full is neither a dict nor a DataFrame.
ValueError – If metric is not present in the provided results, or if a wide DataFrame does not have the expected MultiIndex column structure.
Notes
If you ran netflow.methods.stats.sig_feats_by_group(…, top_n=…), then records_full may only include a subset of features per group. In that case, “unique” is evaluated relative to what is present in records_full (missing features are treated as not significant for that group). For strict uniqueness across all tested features, generate results without top_n. It is therefore recommended and expected that results were generated without top_n.
If netflow.methods.stats.sig_feats_by_group() used the two-category optimization (compute only one direction), then records_full may not contain both groups; uniqueness can only be assessed across the groups that are present in records_full.