netflow.methods.stats#
Functions
|
Compute Pearson correlation with each row in Y for a given row in X. |
|
Compute Pearson correlation between each row of X and all rows of Y in parallel. |
|
Compute Spearman correlation with each row in Y for a given row in X. |
|
Compute Spearman correlation between each row of X and all rows of Y in parallel. |
|
Perform the Mann-Whitney U rank test on two independent samples. |
|
Choose and perform a statistical test, matching your original dispatch behavior. |
|
Compute per-feature p-values comparing two groups of samples. |
|
Compute per-group feature significance vs the rest of the cohort. |
|
Perform statistical test between groups/datasets and apply multiple test correction. |
|
Calculate the T-test for the means of two independent samples of scores. |
|
The Wilcoxon signed-rank test. |
- netflow.methods.stats._apply_multipletests_safe( ) ndarray[source]#
Apply multiple testing correction with stable behavior in presence of NaNs.
- Parameters:
- Returns:
qvals – Corrected p-values. NaN p-values remain NaN in qvals.
- Return type:
numpy.ndarray, shape (n_features,)
Notes
statsmodels.stats.multitest.multipletests does not always behave well with NaNs. This helper replaces NaNs with 1.0 for correction, then restores NaNs.
- netflow.methods.stats._coerce_samples_x_features(
- groups: Series,
- feats: DataFrame,
- *,
- samples_axis: Literal['auto', 'index', 'columns'] = 'auto',
Coerce and align inputs so feature data
featsis shaped as (n_samples, n_features).This helper ensures all downstream computations can assume:
X_dfis a DataFrame with samples as rows and features as columnsX_df.indexmatchesgroups.indexexactly (same sample IDs, same order)
It supports two common input conventions:
featsprovided as samples x features (samples on index)featsprovided as features x samples (samples on columns), in which case it will be transposed.
- Parameters:
groups (pandas.Series) – Group labels indexed by sample ID. The sample IDs are used to align
groupsandfeats.feats (pandas.DataFrame) –
Feature matrix in one of two orientations:
samples x features (
feats.indexare sample IDs,feats.columnsfeature names)features x samples (
feats.columnsare sample IDs,feats.indexfeature names)
samples_axis ({"auto", "index", "columns"}, default="auto") –
Controls how alignment/orientation is determined.
”auto”: infer orientation from whether
groups.indexmatchesfeats.indexorfeats.columns.”index”: enforce that samples are stored on
feats.index.”columns”: enforce that samples are stored on
feats.columns.
- Returns:
groups_aligned (pandas.Series) –
groupsreindexed to match the sample order of the returned feature matrix.X_df (pandas.DataFrame) – Feature matrix oriented as samples x features, with:
X_df.index= sample IDs (same order asgroups_aligned.index)X_df.columns= feature names
- Raises:
TypeError – If
groupsis not a Series orfeatsis not a DataFrame.ValueError – If the function cannot determine a consistent alignment between
groupsandfeatsunder the givensamples_axis.
Notes
This function is intentionally strict: it does not silently drop samples. If indices do not align, it raises with a clear message rather than producing a subtly misaligned analysis.
- netflow.methods.stats._coerce_two_dfs_samples_x_features(
- df1: DataFrame,
- df2: DataFrame,
- *,
- samples_axis: Literal['auto', 'index', 'columns'] = 'auto',
- test: Literal['MWU', 't-test', 'wilcoxon'] = 'MWU',
Coerce and strictly align two DataFrames so both are shaped as (n_samples, n_features).
- Parameters:
df1 (pandas.DataFrame) – Two matrices to compare. Each may be either: - samples x features (samples on index), or - features x samples (samples on columns)
df2 (pandas.DataFrame) – Two matrices to compare. Each may be either: - samples x features (samples on index), or - features x samples (samples on columns)
samples_axis ({"auto","index","columns"}, default="auto") –
“index”: enforce samples on index (samples x features)
”columns”: enforce samples on columns (features x samples), then transpose
- ”auto”: infer strictly from label agreement between df1 and df2:
if df1.columns == df2.columns and df1.index != df2.index -> samples_axis=”index”
if df1.index == df2.index and df1.columns != df2.columns -> samples_axis=”columns”
if both index and columns match -> ambiguous; choose by heuristic (smaller dimension treated as samples); if tie, raise.
if neither matches -> raise.
test ({"MWU","t-test","wilcoxon"}, default="MWU") – If “wilcoxon”, requires paired samples after coercion (same sample index).
- Returns:
X1, X2 – Coerced matrices as samples x features, strictly aligned on features. If test=”wilcoxon”, also strictly aligned on samples.
- Return type:
pandas.DataFrame
- Raises:
ValueError – If strict alignment is not possible under the requested/inferred orientation.
- netflow.methods.stats._feature_chunks( ) List[Tuple[int, int]][source]#
Partition a number of features into contiguous chunks (half-open intervals).
- netflow.methods.stats._mwu_mask_chunk( ) Tuple[int, ndarray][source]#
Compute MWU p-values for a chunk of feature columns given group/rest row indices.
- Parameters:
task (tuple) – (start, end, g_idx, r_idx) where: - start/end define feature interval [start, end) - g_idx are sample indices in group - r_idx are sample indices in rest
- Returns:
start (int) – Start feature index.
pvals_chunk (numpy.ndarray) – MWU p-values for features in [start, end).
Notes
This avoids materializing (n_group x n_features) and (n_rest x n_features) matrices. It only gathers the 1D vectors needed per feature.
- netflow.methods.stats._mwu_mask_init_worker( ) None[source]#
Initializer for MWU group-vs-rest workers.
- Parameters:
X (numpy.ndarray, shape (n_samples, n_features)) – Full matrix.
alternative ({"two-sided","less","greater"}) – Alternative hypothesis passed to SciPy.
nan_policy ({"omit","raise"}) – NaN handling.
mwu_kwargs (dict) – Extra keyword args forwarded to
scipy.stats.mannwhitneyu.
- Return type:
None
- netflow.methods.stats._mwu_pvals_group_vs_rest(
- X: ndarray,
- g_mask: ndarray,
- *,
- alternative: Literal['two-sided', 'less', 'greater'] = 'two-sided',
- nan_policy: Literal['omit', 'raise'] = 'omit',
- n_jobs: int = 1,
- parallel_backend: Literal['processes', 'threads', 'none'] = 'processes',
- chunk_size_features: int = 256,
- mwu_kwargs: Dict[str, Any] | None = None,
Compute MWU p-values for group vs rest without materializing submatrices.
- Parameters:
X (numpy.ndarray, shape (n_samples, n_features)) – Full matrix (samples x features).
g_mask (numpy.ndarray, dtype=bool, shape (n_samples,)) – Boolean mask selecting group samples.
alternative ({"two-sided","less","greater"}, default="two-sided") – Alternative hypothesis for MWU.
nan_policy ({"omit","raise"}, default="omit") – NaN handling.
n_jobs (int, default=1) – Parallel workers for MWU feature loop.
parallel_backend ({"processes","threads","none"}, default="processes") – Backend for parallel execution.
chunk_size_features (int, default=256) – Feature chunk size.
mwu_kwargs (dict, optional) – Extra keyword args forwarded to
scipy.stats.mannwhitneyu.
- Returns:
pvals – MWU p-values per feature.
- Return type:
numpy.ndarray, shape (n_features,)
Notes
MWU is not vectorized across features in SciPy, so we compute it feature-wise. This implementation is memory efficient because it does not allocate X_group/X_rest.
- netflow.methods.stats._pair_chunk( ) Tuple[int, ndarray][source]#
Compute p-values for a chunk of feature columns for MWU or Wilcoxon.
- Parameters:
start_end (tuple(int, int)) – Half-open feature interval (start, end).
- Returns:
start (int) – Start feature index.
pvals_chunk (numpy.ndarray) – P-values for features in [start, end).
Notes
This is used only for tests that SciPy does not vectorize across features: MWU and Wilcoxon.
- netflow.methods.stats._pair_init_worker(
- X1: ndarray,
- X2: ndarray,
- *,
- test: Literal['MWU', 't-test'],
- alternative: str,
- nan_policy: str,
- test_kwargs: Dict[str, Any],
Initializer for parallel pairwise feature-wise workers (MWU/Wilcoxon).
- Parameters:
X1 (numpy.ndarray) – Matrices shaped (n_samples1, n_features) and (n_samples2, n_features).
X2 (numpy.ndarray) – Matrices shaped (n_samples1, n_features) and (n_samples2, n_features).
test ({"MWU","wilcoxon"}) – Feature-wise tests supported by this worker.
alternative ({"two-sided","less","greater"}) – Alternative hypothesis passed to SciPy.
nan_policy ({"omit","raise"}) – NaN handling.
test_kwargs (dict) – Extra keyword args forwarded to SciPy test functions.
- Return type:
None
- netflow.methods.stats.compute_pearson(
- row_x,
- row_y,
Compute Pearson correlation with each row in Y for a given row in X.
- Parameters:
row_x (array_like) – 1-D arrays representing multiple observations of a single variable. The correlation is computed between
row_xandrow_y.row_y (array_like) – 1-D arrays representing multiple observations of a single variable. The correlation is computed between
row_xandrow_y.
- Returns:
correlation (float) – The correlation.
p_value (float) – The p-value.
- netflow.methods.stats.compute_pearson_parallel(
- X,
- Y,
- num_processors=None,
- chunksize=None,
Compute Pearson correlation between each row of X and all rows of Y in parallel.
- Parameters:
X (pandas.DataFrame) – Dataframes containing multiple variables and observations. Each row represents a variable and each column is an observation of each variable. X and Y must have the same number of columns (i.e., the same observations) but they need not have the same number of variables.
Y (pandas.DataFrame) – Dataframes containing multiple variables and observations. Each row represents a variable and each column is an observation of each variable. X and Y must have the same number of columns (i.e., the same observations) but they need not have the same number of variables.
num_processors (int) – Number of processors to use. Defaults to None (uses all available).
chunksize
- Returns:
correlations (dict) – The resulting correlations in the form
{index_row_X: {index_row_Y: corr}}p_values (dict) – The resulting p_values in the form
{index_row_X: {index_row_Y: p_value}}
- netflow.methods.stats.compute_spearman(
- row_x,
- row_y,
Compute Spearman correlation with each row in Y for a given row in X.
- Parameters:
row_x (array_like) – 1-D arrays representing multiple observations of a single variable. The correlation is computed between
row_xandrow_y.row_y (array_like) – 1-D arrays representing multiple observations of a single variable. The correlation is computed between
row_xandrow_y.
- Returns:
correlation (float) – The correlation.
p_value (float) – The p-value.
- netflow.methods.stats.compute_spearman_parallel(
- X,
- Y,
- num_processors=None,
- chunksize=None,
Compute Spearman correlation between each row of X and all rows of Y in parallel.
- Parameters:
X (pandas.DataFrame) – Dataframes containing multiple variables and observations. Each row represents a variable and each column is an observation of each variable. X and Y must have the same number of columns (i.e., the same observations) but they need not have the same number of variables.
Y (pandas.DataFrame) – Dataframes containing multiple variables and observations. Each row represents a variable and each column is an observation of each variable. X and Y must have the same number of columns (i.e., the same observations) but they need not have the same number of variables.
num_processors (int) – Number of processors to use. Defaults to None (uses all available).
chunksize
- Returns:
correlations (dict) – The resulting correlations in the form
{index_row_X: {index_row_Y: corr}}p_values (dict) – The resulting p_values in the form
{index_row_X: {index_row_Y: p_value}}
- netflow.methods.stats.mann_whitney_u_test(
- values1,
- values2,
- alternative='two-sided',
- **kwargs,
Perform the Mann-Whitney U rank test on two independent samples.
The Mann-Whitney U test is a nonparametric test of the null hypothesis that the distribution underlying sample x is the same as the distribution underlying sample y. It is often used as a test of difference in location between distributions.
Computed via
scipy.stats.mannwhitneyu.- Parameters:
values1 (array-like) – The arrays must have the same shape, except in the dimension corresponding to axis (the first, by default), which can be specified in
kwargs.values2 (array-like) – The arrays must have the same shape, except in the dimension corresponding to axis (the first, by default), which can be specified in
kwargs.alternative ({'two-sided', 'less', 'greater'}, optional) –
Defines the alternative hypothesis. The following options are available (default is ‘two-sided’):
’two-sided’: the means of the distributions underlying the samples are unequal.
’less’: the mean of the distribution underlying the first sample is less than the mean of the distribution underlying the second sample.
’greater’: the mean of the distribution underlying the first sample is greater than the mean of the distribution underlying the second sample.
kwargs (dict) – Key-word arguments passed to
scipy.stats.mannwhitneyu.
- Returns:
p_value – The p-value.
- Return type:
float
- netflow.methods.stats.perform_stat_test(
- values1,
- values2,
- test_type: str,
- **kwargs,
Choose and perform a statistical test, matching your original dispatch behavior.
- Parameters:
values1 (array-like) – Vectors of measurements. For ‘wilcoxon’, these are paired vectors.
values2 (array-like) – Vectors of measurements. For ‘wilcoxon’, these are paired vectors.
test_type ({"t-test","MWU","wilcoxon"}) –
- Which test to apply:
”t-test”: two-sample independent t-test
”MWU”: Mann–Whitney U test
”wilcoxon”: Wilcoxon signed-rank test (paired)
**kwargs (dict) –
- Forwarded to the underlying SciPy test wrapper. In particular:
You may pass alternative in kwargs for all supported tests.
- Returns:
p_value – P-value from the chosen test.
- Return type:
- Raises:
ValueError – If an invalid test_type is provided.
- netflow.methods.stats.perform_stat_test_matrix(
- X1: DataFrame | ndarray,
- X2: DataFrame | ndarray,
- *,
- test: Literal['MWU', 't-test'] = 'MWU',
- alternative: Literal['two-sided', 'less', 'greater'] = 'two-sided',
- nan_policy: Literal['omit', 'raise'] = 'omit',
- equal_var: bool = False,
- n_jobs: int = 1,
- parallel_backend: Literal['processes', 'threads', 'none'] = 'processes',
- chunk_size_features: int = 256,
- test_kwargs: Dict[str, Any] | None = None,
Compute per-feature p-values comparing two groups of samples.
This function assumes inputs are samples x features and returns a vector of p-values of length n_features.
- Parameters:
X1 (pandas.DataFrame or numpy.ndarray) – Two matrices of shape (n_samples1, n_features) and (n_samples2, n_features). If DataFrames are given, they are converted once to NumPy arrays.
X2 (pandas.DataFrame or numpy.ndarray) – Two matrices of shape (n_samples1, n_features) and (n_samples2, n_features). If DataFrames are given, they are converted once to NumPy arrays.
test ({"MWU","t-test","wilcoxon"}, default="MWU") –
Statistical test to perform:
”t-test”: independent two-sample t-test via
scipy.stats.ttest_indwithaxis=0”MWU”: Mann–Whitney U via
scipy.stats.mannwhitneyu(feature-wise loop/parallel)”wilcoxon”: Wilcoxon signed-rank via
scipy.stats.wilcoxon(paired; requires same n_samples)
alternative ({"two-sided","less","greater"}, default="two-sided") – Alternative hypothesis passed to SciPy.
nan_policy ({"omit","raise"}, default="omit") – NaN handling: - “omit”: omit NaNs feature-wise (t-test uses SciPy nan_policy; MWU/Wilcoxon omit manually) - “raise”: raise if NaNs are present (t-test uses SciPy; MWU/Wilcoxon will yield NaNs or raise upstream)
equal_var (bool, default=False) – Only used for “t-test”. False means Welch’s t-test.
n_jobs (int, default=1) – Number of workers for MWU/Wilcoxon feature loops. If 1, runs serially.
parallel_backend ({"processes","threads","none"}, default="processes") – Backend used for MWU/Wilcoxon parallel execution.
chunk_size_features (int, default=256) – Number of features per parallel chunk for MWU/Wilcoxon.
test_kwargs (dict, optional) – Extra keyword args forwarded to the underlying SciPy test. - MWU: forwarded to
mannwhitneyu- Wilcoxon: forwarded towilcoxon- t-test: forwarded tottest_ind(in addition to alternative/equal_var/nan_policy)
- Returns:
pvals – Raw p-values per feature.
- Return type:
numpy.ndarray, shape (n_features,)
- Raises:
ValueError – If feature dimensions mismatch, or if wilcoxon is requested with mismatched sample counts.
Notes
Why SciPy t-test is enough here: -
ttest_indsupports vectorization across features withaxis=0,so there is no Python loop per feature.
Using SciPy directly keeps this implementation simple and robust.
Why MWU/Wilcoxon still loop: - SciPy does not vectorize these tests across features, so looping (and optional parallelism)
is necessary if you want one p-value per feature.
- netflow.methods.stats.sig_feats_by_group(
- groups: Series,
- feats: DataFrame,
- *,
- test: Literal['MWU', 't-test'] = 'MWU',
- alpha: float = 0.05,
- method: str = 'fdr_bh',
- min_group_size: int = 10,
- samples_axis: Literal['auto', 'index', 'columns'] = 'auto',
- alternative: Literal['two-sided', 'less', 'greater'] = 'two-sided',
- nan_policy: Literal['omit', 'raise'] = 'omit',
- equal_var: bool = False,
- n_jobs: int = 1,
- parallel_backend: Literal['processes', 'threads', 'none'] = 'processes',
- chunk_size_features: int = 256,
- test_kwargs: Dict[str, Any] | None = None,
- top_n: int | None = None,
- return_type: Literal['dict', 'wide'] = 'wide',
- add_effect_sizes: bool = True,
- log2fc_pseudocount: float = 1e-09,
Compute per-group feature significance vs the rest of the cohort.
For each group label g, compare samples in g vs all other samples, producing per-feature p-values and multiple-test-corrected p-values.
- Parameters:
groups (pandas.Series) – Group labels indexed by sample ID.
feats (pandas.DataFrame) –
Feature matrix in either orientation: - samples x features (samples on index) or - features x samples (samples on columns)
samples_axis controls inference/forcing of orientation.
test ({"MWU","t-test","wilcoxon"}, default="MWU") –
Statistical test to perform. - “MWU”: Mann–Whitney U (unpaired, rank-based) - “t-test”: independent two-sample t-test (Welch by default) - “wilcoxon”: Wilcoxon signed-rank (paired)
IMPORTANT: “wilcoxon” is not a generic group-vs-rest test (paired design). This function will raise if test=”wilcoxon”.
alpha (float, default=0.05) – Error rate for multiple testing correction.
method (str, default="fdr_bh") –
Multiple test correction method.
bonferroni: one-step correctionsidak: one-step correctionholm-sidak: step down method using Sidak adjustmentsholm: step-down method using Bonferroni adjustmentssimes-hochberg: step-up method (independent)hommel: closed method based on Simes tests (non-negative)fdr_bh: Benjamini/Hochberg (non-negative)fdr_by: Benjamini/Yekutieli (negative)fdr_tsbh: two stage fdr correction (non-negative)fdr_tsbky: two stage fdr correction (non-negative)
min_group_size (int, default=10) – Minimum number of samples required for a group to be tested. Must be >= 3.
samples_axis ({"auto","index","columns"}, default="auto") – Orientation control passed to _coerce_samples_x_features.
alternative ({"two-sided","less","greater"}, default="two-sided") – Alternative hypothesis.
nan_policy ({"omit","raise"}, default="omit") – NaN handling.
equal_var (bool, default=False) – Used for t-test only. False means Welch’s t-test.
n_jobs (int, default=1) – Workers for MWU feature-wise computation. If 1, runs serially.
parallel_backend ({"processes","threads","none"}, default="processes") – Backend for MWU parallelism.
chunk_size_features (int, default=256) – Feature chunk size for MWU parallelism.
test_kwargs (dict, optional) – Extra kwargs forwarded to SciPy test functions.
top_n (int, optional) – If provided, keep only the top_n features per group after sorting by corrected p-value then raw p-value.
return_type ({"wide", "dict"}, default=False) – If “wide” (default), return a wide DataFrame with MultiIndex columns (group, metric). If “dict”, return a dict mapping group -> per-group record DataFrame.
add_effect_sizes (bool, default=True) –
If True, include effect size summaries per feature:
n_in: number of samples within group (constant across features)
n_out: number of samples outside group (constant across features)
mean_in: mean feature value within group
mean_out: mean feature value outside group
mean_diff: mean_in - mean_out
log2fc: log2((mean_in + pseudocount) / (mean_out + pseudocount))
log2fc_pseudocount (float, default=1e-9) – Pseudocount added to means for log2 fold-change to avoid division by zero and log(0). Only used when add_effect_sizes=True.
- Returns:
records_full –
- If return_type=”wide”:
DataFrame with index=features and columns MultiIndex (group, metric), where metric in {“p-value”,”corrected p-value”}.
- If return_type=”dict”:
dict[group_label -> DataFrame(index=features, columns=[“p-value”,”corrected p-value”])] Each per-group DataFrame is sorted by corrected p-value then p-value.
- Return type:
dict or pandas.DataFrame
- Raises:
ValueError – If min_group_size < 3, method invalid, alignment fails, or test=”wilcoxon”.
Notes
t-test uses SciPy’s vectorized implementation across features via axis=0.
MWU is computed feature-wise (SciPy is not vectorized). This implementation avoids large submatrix allocations by gathering per-feature vectors via indices.
- netflow.methods.stats.stat_test(
- df1: DataFrame,
- df2: DataFrame,
- test: Literal['MWU', 't-test'] = 'MWU',
- alpha: float = 0.05,
- method: str = 'fdr_bh',
- samples_axis: Literal['auto', 'index', 'columns'] = 'auto',
- alternative: Literal['two-sided', 'less', 'greater'] = 'two-sided',
- nan_policy: Literal['omit', 'raise'] = 'omit',
- equal_var: bool = False,
- n_jobs: int = 1,
- parallel_backend: Literal['processes', 'threads', 'none'] = 'processes',
- chunk_size_features: int = 256,
- test_kwargs: Dict[str, Any] | None = None,
Perform statistical test between groups/datasets and apply multiple test correction.
This compares df1 vs df2 feature-by-feature, returning raw and corrected p-values.
The statistical tests are Computed via
scipy.stats.- Parameters:
df1 (pandas.DataFrame) –
The measurements, where rows are features and columns are observations. The dataframes must have the same number of features (rows). If
test='wilcoxon', they must also have the same number of observationas (columns).Note: Can now handle datasets oriented as:
samples x features (samples on index, features on columns)
features x samples (features on index, samples on columns)
The samples_axis argument controls whether orientation is inferred or enforced.
If test=”wilcoxon” (paired), the sample dimension must match and be aligned after coercion.
df2 (pandas.DataFrame) –
The measurements, where rows are features and columns are observations. The dataframes must have the same number of features (rows). If
test='wilcoxon', they must also have the same number of observationas (columns).Note: Can now handle datasets oriented as:
samples x features (samples on index, features on columns)
features x samples (features on index, samples on columns)
The samples_axis argument controls whether orientation is inferred or enforced.
If test=”wilcoxon” (paired), the sample dimension must match and be aligned after coercion.
test ({"MWU","t-test","wilcoxon"}) –
The statistical test that should be performed. Options are:
’MWU’ : Mann Whitney-U Test (default).
’t-test’ : T-test
’wilcoxon’ : Wilcoxon Signed Rank Test
alpha (float) – The family-wise error rate (FWER) passed to statsmodels multipletests, should be between 0 and 1.
method (str) –
Method for multiple test correction, default=’fdr_bh’.
Options:
bonferroni : one-step correction
sidak : one-step correction
holm-sidak : step down method using Sidak adjustments
holm : step-down method using Bonferroni adjustments
simes-hochberg : step-up method (independent)
hommel : closed method based on Simes tests (non-negative)
fdr_bh : Benjamini/Hochberg (non-negative)
fdr_by : Benjamini/Yekutieli (negative)
fdr_tsbh : two stage fdr correction (non-negative)
fdr_tsbky : two stage fdr correction (non-negative)
samples_axis ({"auto","index","columns"}, default="auto") –
Orientation control:
”index”: enforce samples on df.index (df is samples x features)
”columns”: enforce samples on df.columns (df is features x samples), then transpose
”auto”: infer orientation strictly from label agreement between df1 and df2 (see _coerce_two_dfs_samples_x_features for details)
alternative ({"two-sided","less","greater"}, default="two-sided") – Alternative hypothesis passed to the underlying SciPy test.
nan_policy ({"omit","raise"}, default="omit") – NaN handling. For t-test, this is passed to SciPy. For MWU/Wilcoxon, NaNs are handled feature-wise by the underlying implementation.
equal_var (bool, default=False) – Used for t-test only. False means Welch’s t-test.
n_jobs (int, default=1) – Workers for MWU/Wilcoxon (feature-wise). If 1, runs serially.
parallel_backend ({"processes","threads","none"}, default="processes") – Backend for MWU/Wilcoxon.
chunk_size_features (int, default=256) – Feature chunk size for MWU/Wilcoxon.
test_kwargs (dict) – Key-word arguments passed to
scipy.statsfor performing the statistical test.
- Returns:
record (pandas.DataFrame) – DataFrame indexed by feature name with columns: - “p-value” - “corrected p-value”
DataFrame indexed by feature name with columns –
“p-value”
”corrected p-value”
Notes
For unpaired tests (“MWU” and “t-test”), this wrapper routes to the optimized group-vs-rest engine _sig_feats_by_group_core by constructing a two-group grouping vector over the concatenated observations.
- MWU can be parallelized across feature chunks using additional kwargs:
n_jobs (int): number of workers (default 1)
parallel_backend ({“processes”,”threads”,”none”}): backend (default “processes”)
chunk_size_features (int): features per chunk (default 256)
For “wilcoxon” (paired), a paired signed-rank test is computed between df1 and df2 columns, feature-wise.
- These keys (if present) are consumed by the wrapper and not forwarded to SciPy:
alternative : {“two-sided”,”less”,”greater”} (default “two-sided”)
nan_policy : {“omit”,”raise”} (default “omit”) * For t-test, this is passed to SciPy. * For MWU/wilcoxon, NaNs are omitted feature-wise when nan_policy=”omit”.
equal_var : bool (default False) for t-test (Welch vs pooled)
n_jobs : int (default 1) for MWU parallelism
parallel_backend : {“processes”,”threads”,”none”} (default “processes”) for MWU
chunk_size_features : int (default 256) for MWU
- netflow.methods.stats.t_test(
- values1,
- values2,
- alternative='two-sided',
- equal_var=False,
- **kwargs,
Calculate the T-test for the means of two independent samples of scores.
This is a test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.
Computed via
scipy.stats.ttest_ind.- Parameters:
values1 (array-like) – The arrays must have the same shape, except in the dimension corresponding to axis (the first, by default), which can be specified in
kwargs.values2 (array-like) – The arrays must have the same shape, except in the dimension corresponding to axis (the first, by default), which can be specified in
kwargs.alternative ({'two-sided', 'less', 'greater'}, optional) –
Defines the alternative hypothesis. The following options are available (default is ‘two-sided’):
’two-sided’: the means of the distributions underlying the samples are unequal.
’less’: the mean of the distribution underlying the first sample is less than the mean of the distribution underlying the second sample.
’greater’: the mean of the distribution underlying the first sample is greater than the mean of the distribution underlying the second sample.
equal_var (bool, default=False) – Passed to
scipy.stats.ttest_ind; default False corresponds to Welch’s t-test. If True, performs the standard independent 2 sample test that assumes equal population variances.kwargs (dict) – Key-word arguments passed to
scipy.stats.ttest_ind.
- Returns:
p_value – The p-value.
- Return type:
float
- netflow.methods.stats.wilcoxon_signed_rank_test(
- values1,
- values2=None,
- alternative='two-sided',
- **kwargs,
The Wilcoxon signed-rank test.
The Wilcoxon signed-rank test tests the null hypothesis that two related paired samples come from the same distribution. In particular, it tests whether the distribution of the differences x - y is symmetric about zero. It is a non-parametric version of the paired T-test.
Computed via
scipy.stats.wilcoxon.- Parameters:
values1 (array-like) – Either the first set of measurements (in which case
yis the second set of measurements), or the differences between two sets of measurements (in which caseyis not to be specified.) Must be one-dimensional.values2 (array-like) – Optional. Either the second set of measurements (if
xis the first set of measurements), or not specified (ifxis the differences between two sets of measurements.) Must be one-dimensional.alternative ({'two-sided', 'less', 'greater'}, optional) –
Defines the alternative hypothesis. The following options are available (default is ‘two-sided’):
’two-sided’: the means of the distributions underlying the samples are unequal.
’less’: the mean of the distribution underlying the first sample is less than the mean of the distribution underlying the second sample.
’greater’: the mean of the distribution underlying the first sample is greater than the mean of the distribution underlying the second sample.
kwargs (dict) – Key-word arguments passed to
scipy.stats.wilcoxon.
- Returns:
p_value – The p-value.
- Return type:
float