netflow.methods.stats#

Functions

compute_pearson(row_x, row_y)

Compute Pearson correlation with each row in Y for a given row in X.

compute_pearson_parallel(X, Y[, ...])

Compute Pearson correlation between each row of X and all rows of Y in parallel.

compute_spearman(row_x, row_y)

Compute Spearman correlation with each row in Y for a given row in X.

compute_spearman_parallel(X, Y[, ...])

Compute Spearman correlation between each row of X and all rows of Y in parallel.

mann_whitney_u_test(values1, values2[, ...])

Perform the Mann-Whitney U rank test on two independent samples.

perform_stat_test(values1, values2, ...)

stat_test(df1, df2[, test, alpha, method])

Perform statistical test between datasets with FWER correction.

t_test(values1, values2[, alternative])

Calculate the T-test for the means of two independent samples of scores.

wilcoxon_signed_rank_test(values1[, ...])

The Wilcoxon signed-rank test.

netflow.methods.stats.compute_pearson(row_x, row_y)[source]#

Compute Pearson correlation with each row in Y for a given row in X.

Parameters:
  • row_x (array_like) – 1-D arrays representing multiple observations of a single variable. The correlation is computed between row_x and row_y.

  • row_y (array_like) – 1-D arrays representing multiple observations of a single variable. The correlation is computed between row_x and row_y.

Returns:

  • correlation (float) – The correlation.

  • p_value (float) – The p-value.

netflow.methods.stats.compute_pearson_parallel(X, Y, num_processors=None, chunksize=None)[source]#

Compute Pearson correlation between each row of X and all rows of Y in parallel.

Parameters:
  • X (pandas.DataFrame) – Dataframes containing multiple variables and observations. Each row represents a variable and each column is an observation of each variable. X and Y must have the same number of columns (i.e., the same observations) but they need not have the same number of variables.

  • Y (pandas.DataFrame) – Dataframes containing multiple variables and observations. Each row represents a variable and each column is an observation of each variable. X and Y must have the same number of columns (i.e., the same observations) but they need not have the same number of variables.

  • num_processors (int) – Number of processors to use. Defaults to None (uses all available).

Returns:

  • correlations (dict) – The resulting correlations in the form {index_row_X: {index_row_Y: corr}}

  • p_values (dict) – The resulting p_values in the form {index_row_X: {index_row_Y: p_value}}

netflow.methods.stats.compute_spearman(row_x, row_y)[source]#

Compute Spearman correlation with each row in Y for a given row in X.

Parameters:
  • row_x (array_like) – 1-D arrays representing multiple observations of a single variable. The correlation is computed between row_x and row_y.

  • row_y (array_like) – 1-D arrays representing multiple observations of a single variable. The correlation is computed between row_x and row_y.

Returns:

  • correlation (float) – The correlation.

  • p_value (float) – The p-value.

netflow.methods.stats.compute_spearman_parallel(X, Y, num_processors=None, chunksize=None)[source]#

Compute Spearman correlation between each row of X and all rows of Y in parallel.

Parameters:
  • X (pandas.DataFrame) – Dataframes containing multiple variables and observations. Each row represents a variable and each column is an observation of each variable. X and Y must have the same number of columns (i.e., the same observations) but they need not have the same number of variables.

  • Y (pandas.DataFrame) – Dataframes containing multiple variables and observations. Each row represents a variable and each column is an observation of each variable. X and Y must have the same number of columns (i.e., the same observations) but they need not have the same number of variables.

  • num_processors (int) – Number of processors to use. Defaults to None (uses all available).

Returns:

  • correlations (dict) – The resulting correlations in the form {index_row_X: {index_row_Y: corr}}

  • p_values (dict) – The resulting p_values in the form {index_row_X: {index_row_Y: p_value}}

netflow.methods.stats.mann_whitney_u_test(values1, values2, alternative='two-sided', **kwargs)[source]#

Perform the Mann-Whitney U rank test on two independent samples.

The Mann-Whitney U test is a nonparametric test of the null hypothesis that the distribution underlying sample x is the same as the distribution underlying sample y. It is often used as a test of difference in location between distributions.

Computed via scipy.stats.mannwhitneyu.

Parameters:
  • values1 (array-like) – The arrays must have the same shape, except in the dimension corresponding to axis (the first, by default), which can be specified in kwargs.

  • values2 (array-like) – The arrays must have the same shape, except in the dimension corresponding to axis (the first, by default), which can be specified in kwargs.

  • alternative ({'two-sided', 'less', 'greater'}, optional) –

    Defines the alternative hypothesis. The following options are available (default is ‘two-sided’):

    • ’two-sided’: the means of the distributions underlying the samples are unequal.

    • ’less’: the mean of the distribution underlying the first sample is less than the mean of the distribution underlying the second sample.

    • ’greater’: the mean of the distribution underlying the first sample is greater than the mean of the distribution underlying the second sample.

  • kwarags (dict) – Key-word arguments passed to scipy.stats.mannwhitneyu.

Returns:

p_value – The p-value.

Return type:

float

netflow.methods.stats.perform_stat_test(values1, values2, test_type, **kwargs)[source]#
netflow.methods.stats.stat_test(df1, df2, test='MWU', alpha=0.05, method='fdr_bh', **kwargs)[source]#

Perform statistical test between datasets with FWER correction.

The statistical tests are Computed via scipy.stats.

Parameters:
  • df1 (pandas.DataFrame) – The measurements, where rows are features and columns are observations. The dataframes must have the same number of features (rows). If test='wilcoxon', they must also have the same number of observationas (columns).

  • df2 (pandas.DataFrame) – The measurements, where rows are features and columns are observations. The dataframes must have the same number of features (rows). If test='wilcoxon', they must also have the same number of observationas (columns).

  • test (str) –

    The statistical test that should be performed. Options are:

    • ’MWU’ : Mann Whitney-U Test (default).

    • ’t-test’ : T-test

    • ’wilcoxon’ : Wilcoxon Signed Rank Test

  • alpha (float) – The family-wise error rate (FWER), should be between 0 and 1.

  • method (str) –

    Method for multiple test correction, default=’fdr_bh’.

    Options:

    • bonferroni : one-step correction

    • sidak : one-step correction

    • holm-sidak : step down method using Sidak adjustments

    • holm : step-down method using Bonferroni adjustments

    • simes-hochberg : step-up method (independent)

    • hommel : closed method based on Simes tests (non-negative)

    • fdr_bh : Benjamini/Hochberg (non-negative)

    • fdr_by : Benjamini/Yekutieli (negative)

    • fdr_tsbh : two stage fdr correction (non-negative)

    • fdr_tsbky : two stage fdr correction (non-negative)

  • kwargs (dict) – Key-word arguments passed to scipy.stats for performing the statistical test.

Returns:

record – Record of each feature, p-value, and corrected p-value.

Return type:

pandas.DataFrame

netflow.methods.stats.t_test(values1, values2, alternative='two-sided', **kwargs)[source]#

Calculate the T-test for the means of two independent samples of scores.

This is a test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.

Computed via scipy.stats.ttest_ind.

Parameters:
  • values1 (array-like) – The arrays must have the same shape, except in the dimension corresponding to axis (the first, by default), which can be specified in kwargs.

  • values2 (array-like) – The arrays must have the same shape, except in the dimension corresponding to axis (the first, by default), which can be specified in kwargs.

  • alternative ({'two-sided', 'less', 'greater'}, optional) –

    Defines the alternative hypothesis. The following options are available (default is ‘two-sided’):

    • ’two-sided’: the means of the distributions underlying the samples are unequal.

    • ’less’: the mean of the distribution underlying the first sample is less than the mean of the distribution underlying the second sample.

    • ’greater’: the mean of the distribution underlying the first sample is greater than the mean of the distribution underlying the second sample.

  • kwarags (dict) – Key-word arguments passed to scipy.stats.ttest_ind.

Returns:

p_value – The p-value.

Return type:

float

netflow.methods.stats.wilcoxon_signed_rank_test(values1, values2=None, alternative='two-sided', **kwargs)[source]#

The Wilcoxon signed-rank test.

The Wilcoxon signed-rank test tests the null hypothesis that two related paired samples come from the same distribution. In particular, it tests whether the distribution of the differences x - y is symmetric about zero. It is a non-parametric version of the paired T-test.

Computed via scipy.stats.wilcoxon.

Parameters:
  • values1 (array-like) – Either the first set of measurements (in which case y is the second set of measurements), or the differences between two sets of measurements (in which case y is not to be specified.) Must be one-dimensional.

  • values2 (array-like) – Optional. Either the second set of measurements (if x is the first set of measurements), or not specified (if x is the differences between two sets of measurements.) Must be one-dimensional.

  • alternative ({'two-sided', 'less', 'greater'}, optional) –

    Defines the alternative hypothesis. The following options are available (default is ‘two-sided’):

    • ’two-sided’: the means of the distributions underlying the samples are unequal.

    • ’less’: the mean of the distribution underlying the first sample is less than the mean of the distribution underlying the second sample.

    • ’greater’: the mean of the distribution underlying the first sample is greater than the mean of the distribution underlying the second sample.

  • kwarags (dict) – Key-word arguments passed to scipy.stats.wilcoxon.

Returns:

p_value – The p-value.

Return type:

float