netflow.methods.classes#

Classes to compute distances.

Classes

InfoNet(keeper, graph_key, layer[, verbose])

A class to compute information flow on a network and correlation between network modules

class netflow.methods.classes.InfoNet(keeper, graph_key, layer, verbose=None)[source]#

A class to compute information flow on a network and correlation between network modules

Parameters:
  • keeper (netflow.Keeper) – The keeper object that stores the data of size (n_features, n_observations).

  • graph_key ('str') – The key to the graph in the graph keeper that should be used. (Does not have to include all features in the data)

  • layer (str) – The key to the data in the data keeper that should be used.

__init__(keeper, graph_key, layer, verbose=None)[source]#
__module__ = 'netflow.methods.classes'#
anisotropic_laplacian_matrix(observation=None, use_spectral_gap=False, data=None)[source]#

returns transpose of the anisotropic random-walk Laplacian matrix

Parameters:
  • observation ({None, str}) – If provided, use observation-weighted graph to construct the Laplacian. Otherwise, if None, The Laplacian is constructed from the unweighted graph, treated with uniform weights equal to 1.

  • use_spectral_gap (bool) – Option to use spectral gap.

  • data ({None, str, pandas.DataFrame}) –

    Specify which data the observation profile should be taken from. Note, this is ignored if observation is None.

    data options:

    • None : The original data is used.

    • strdata is is expected to be a key in self.meta and the observation profile is taken from the

      data in the correesponding dict-value.

    • pandas.DataFrameThe dataframe is used to select the observation profile, where columns are observations and the rows are

      expected to be labelled by the node indices, \(0, 1, ..., n-1\) where \(n\) is the number of nodes in the graph.

Returns:

Transpose of the random-walk graph Laplacian matrix.

Return type:

Lrw

compute_graph_distances(G=None, weight='weight')[source]#

compute graph distances

Parameters:
  • G (networkx.Graph) – The graph with nodes assumed to be labeled consecutively from \(0, 1, ..., n-1\) where \(n\) is the number of nodes.

  • weight (str, optional) – Edge attribute of weights used for computing the weighted hop distance. If None, compute the unweighted distance. That is, rather than minimizing the sum of weights over all paths between every two nodes, minimize the number of edges.

Returns:

dist – A matrix of node-pairwise graph distances between the \(n\) nodes.

Return type:

numpy.ndarray, (n, n)

diffuse_multiple_profiles(observations=None, times=None, t_min=-1.5, t_max=2.5, n_t=10, log_time=True, laplacian=None, use_spectral_gap=False, do_plot=False, **plot_kwargs)[source]#

diffuse multiple observation profiles from original data

# get laplacian for observation or hop distance for all

Parameters:
  • observations ({None, list}) – Observations to iterate over. If None, use all observations in data.

  • times ({None, array}) – Array of times to evaluate the diffusion simulation. Note, If given, t_min, t_max and n_t are ignored.

  • t_min (float) – First time point to evaluate the diffusion simulation. Note, t_min is ignored if times is not None.

  • t_max (float) – Last time point to evaluate the diffusion simulation. Note, t_max must be greater than t_min, i.e, t_max > t_min. Note, t_max is ignored if times is not None.

  • n_t (int) – Number of time points to generate. Note, n_t is ignored if times is not None.

  • log_time (bool) – If True, return n_t numbers spaced evenly on a log scale, where the time sequence starts at 10 ** t_min, ends with 10 ** t_max, and the sequence of times if of the form 10 ** t where t is the n_t evenly spaced points between (and including) t_min and t_max. For example, _get_times(t_min=1, t_max=3, n_t=3, log_time=True) = array([10 ** 1, 10 ** 2, 10 ** 3]) = array([10., 100., 1000.]). If False, return n_t numbers evenly spaced on a linear scale, where the sequence starts at t_min and ends with t_max. For example, _get_times(t_min=1, t_max=3, n_t=3, log_time=False) = array([1. ,2., 3.]).

  • graph_distances (numpy.ndarray, (n, n)) – A matrix of node-pairwise graph distances between the \(n\) nodes ordered by the rows in object.profiles.

  • laplacian (numpy.ndarray, (n, n)) – The transpose of the graph Laplacian matrix where the \(n\) rows and columns ordered by the rows in object.profiles. If None, the observation-specific Laplacian is used.

  • use_spectral_gap (bool) – Option to use spectral gap. Note, This is ignored if laplacian is provided and not None.

  • filename (str) – If not None,

  • do_plot (bool) – If True, plot diffused profiles for each observation.

  • **plot_kwargs (dict) – Key-word arguments passed to plot_profiles (should not include title).

Notes

Side-effects :

  • Saves pandas DataFrame of the diffused profiles, where each row is a time and each column is a feature name,

    for each observation to the file name self.outdir / ‘diffused_profiles’ / ‘diffused_profile_{observation}.csv’

  • If do_plot is True, plots the diffused profile for each observation.

diffuse_profile(observation, times=None, t_min=-1.5, t_max=2.5, n_t=10, log_time=True, laplacian=None, do_save=True)[source]#

diffuse profile from original data.

..todo: get laplacian for observation or hop distance for all

Parameters:
  • observation (str) – Observation profile to use.

  • times ({None, array}) – Array of times to evaluate the diffusion simulation. Note, If given, t_min, t_max and n_t are ignored.

  • t_min (float) – First time point to evaluate the diffusion simulation. Note, t_min is ignored if times is not None.

  • t_max (float) – Last time point to evaluate the diffusion simulation. Note, t_max must be greater than t_min, i.e, t_max > t_min. Note, t_max is ignored if times is not None.

  • n_t (int) – Number of time points to generate. Note, n_t is ignored if times is not None.

  • log_time (bool) – If True, return n_t numbers spaced evenly on a log scale, where the time sequence starts at 10 ** t_min, ends with 10 ** t_max, and the sequence of times if of the form 10 ** t where t is the n_t evenly spaced points between (and including) t_min and t_max. For example, _get_times(t_min=1, t_max=3, n_t=3, log_time=True) = array([10 ** 1, 10 ** 2, 10 ** 3]) = array([10., 100., 1000.]). If False, return n_t numbers evenly spaced on a linear scale, where the sequence starts at t_min and ends with t_max. For example, _get_times(t_min=1, t_max=3, n_t=3, log_time=False) = array([1. ,2., 3.]).

  • graph_distances (numpy.ndarray, (n, n)) – A matrix of node-pairwise graph distances between the \(n\) nodes ordered by the rows in object.profiles.

  • laplacian (numpy.ndarray, (n, n)) – The transpose of the graph Laplacian matrix where the \(n\) rows and columns ordered by the rows in object.profiles.

  • filename (str) – If not None, save results.

  • do_save (bool) – If True, save diffused profile to self.outdir / ‘diffused_profiles’ / ‘diffused_profile_{observation}.csv’

Returns:

profiles – Diffused profiles where each row is a time and each column is a feature name.

Return type:

pandas.DataFrame

dispersion(data, axis=0)[source]#

Data dispersion computed as the absolute value of the variance-to-mean ratio where the variance and mean is computed on the values over the requested axis.

Parameters:
  • data (pandas.DataFrame) – Data used to compute dispersion.

  • axis ({0, 1}) –

    Axis on which the variance and mean is applied on computed.

    Options:

    • 0 : for each column, apply function to the values over the index

    • 1 : for each index, apply function to the values over the columns

Returns:

vmr – Variance-to-mean ratio (vmr) quantifying the disperion.

Return type:

pandas.Series

invariant_measures(label='IM')[source]#

Compute the invariant measure for each observation using data as node weights.

Parameters:

label (str) – Label of key used to store invariant measures in the data keeper.

Returns:

  • Makes the invariant measures attribute of size (n_observations, n_observations)

  • in keeper.data[label] available.

kendall_tau(data, **kwargs)[source]#

Calculate Kendall’s tau correlation coefficient with associated p-value using scipy.stats.kendalltau.

Parameters:
  • data (numpy.ndarray, (n_observations, n_features)) – 2-D array containing multiple variables and observations.

  • **kwargs (dict) – Optional key-word arguments passed to scipy.stats.kendalltau.

Returns:

  • R (pandas.DataFrame) – Spearman correlation matrix. The correlation matrix is square with length equal to total number of variables (columns or rows).

  • pvalue (float) – The p-value for a hypothesis test whose null hypotheisis is that two sets of data are uncorrelated. See documentation for scipy.stats.spearmanr for alternative hypotheses. p has the same shape as R.

load_diffused_profile(observation)[source]#

loads observation diffused profile

Parameters:

observation (str) – Observation profile to load.

Returns:

diffused_profiles – Diffused profiles where each row is a time and each column is a feature name.

Return type:

pandas.DataFrame

load_diffused_timepoint_profile(time, observations=None)[source]#

loads observation diffused profile.

Parameters:
  • time (float) – Timepoint in diffusion simulation to select.

  • observations ({None, list}) – List of observations to iterate over. If None, all observations in self.data are used.

Returns:

diffused_profiles – Diffused profiles at time, for all observations, where rows are nodes and columns are observations.

Return type:

pandas.DataFrame

multiple_pairwise_observation_neighborhood_euc_distance(nodes=None, include_self=False, label='pw_obs_nbhd_euc_dist', desc='Computing pairwise 1-hop distances', profiles_desc='t0', metric='euclidean', normalize=False, **kwargs)[source]#

Compute observation-pairwise Euclidean distances between the profiles over node neighborhood for all nodes.

Parameters:
  • nodes ({None, list, [int]}) – List of nodes to compute neighborhood distances on. If None, all nodes with at least 2 neighbors are used.

  • include_self (bool) – If True, add node in neighborhood which will result in computing normalized profile over the neighborhood. If False, node is not included in neighborhood which results in computing the transition distribution over the neighborhood.

  • label (str) – Label that resulting Euclidean distances are saved in keeper.misc.

  • desc (str) – Description for progress bar.

  • profiles_desc (str, default = “t0”) – Description of profiles used in name of file to store results.

  • metric (str) – The metric used to compute the distance, passed to scipy.spatial.distance.cdist.

  • normalize (bool) – If True, normalize neighborhood profiles to sum to 1.

  • **kwargs (dict) – Extra arguments to metric, passed to scipy.spatial.distance.cdist.

Returns:

eds – Euclidean distances between pairwise observations where rows are observation-pairs and columns are node names. This is saved in keeper.misc with the key label.

Return type:

pandas.DataFrame

Notes

If object.outdir is not None, Euclidean distances are saved to file. Before starting the computation, check if the file exists. If so, load and remove already computed nodes from the iteration. Wasserstein distances are computed for the remaining nodes, combined with the previously computed and saved results before saving and returning the combined results.

multiple_pairwise_observation_neighborhood_wass_distance(nodes=None, include_self=False, graph_distances=None, label='pw_obs_nbhd_wass_dist', desc='Computing pairwise 1-hop distances', profiles_desc='t0', proc=4, chunksize=None, measure_cutoff=1e-06, solvr=None)[source]#

Compute observation-pairwise Wasserstein distances between the profiles over node neighborhood for all nodes.

Parameters:
  • nodes ({None, list [str])}) – List of nodes to compute neighborhood distances on. If None, all genes with at least 2 neighbors is used.

  • include_self (bool) – If True, add node in neighborhood which will result in computing normalized profile over the neighborhood. If False, node is not included in neighborhood which results in computing the transition distribution over the neighborhood.

  • graph_distances (numpy.ndarray, (n, n)) – A matrix of node-pairwise graph distances between the \(n\) nodes (ordered from \(0, 1, ..., n-1\)). If None, use hop distance.

  • label (str) – Label that resulting Wasserstein distances are saved in keeper.misc.

  • desc (str) – Description for progress bar.

  • profiles_desc (str, default = 't0') – Description of profiles used in name of file to store results.

  • measure_cutoff (float) – Threshold for treating values in profiles as zero, default = 1e-6.

  • proc (int) – Number of processor used for multiprocessing. (Default value = cpu_count()).

  • chunksize (int) – Chunksize to allocate for multiprocessing.

  • solvr (str) – Solver to pass to POT library for computing Wasserstein distance.

Returns:

wds – Wasserstein distances between pairwise observations where rows are observation-pairs and columns are node names. This is saved in keeper.misc with the key label.

Return type:

pandas.DataFrame

Notes

If object.outdir is not None, Wasserstein distances are saved to file every 10 iterations. Before starting the computation, check if the file exists. If so, load and remove already computed nodes from the iteration. Wasserstein distances are computed for the remaining nodes, combined with the previously computed and saved results before saving and returning the combined results.

Only nodes with at least 2 neighbors are included, as leaf nodes will all have the same Wassserstein distance and do not provide any further information.

To do: specify if nodes in input are ids or node names and check that loaded data has correct type int or str for nodes

neighborhood(node, include_self=False)[source]#

return neighborhood of the node.

Parameters:
  • node (int) – Node of interest.

  • include_self (bool) – If True, include node in its neighborhood.

Returns:

neighborhood – List of nodes in the neighborhood of node.

Return type:

list

neighborhood_profiles(node, include_self=False)[source]#

return profiles on the neighborhood of the node.

Parameters:
  • node (int) – Node in the graph to compute the neighborhood on.

  • include_self (bool) – If True, add node in neighborhood which will result in computing normalized profile over the neighborhood. If False, node is not included in neighborhood which results in computing the transition distribution over the neighborhood.

Returns:

  • neighborhood (list) – List of nodes in the neighborhood of node.

  • sub_profiles (pandas.DataFrame) – The profiles over the node neighborhood.

pairwise_observation_neighborhood_euc_distance(node, include_self=False, metric='euclidean', normalize=False, **kwargs)[source]#

Compute observation-pairwise Euclidean distances between the profiles over node neighborhood.

Parameters:
  • profiles (pandas.DataFrame (n_features, n_observations)) – Profiles that Euclidean distance is computed between.

  • node (int) – Node in the graph to compute the neighborhood on.

  • include_self (bool) – If True, include node in neighborhood. If False, node is not included in neighborhood.

  • metric (str) – The metric used to compute the distance, passed to scipy.spatial.distance.cdist.

  • normalize (bool) – If True, normalize neighborhood profiles to sum to 1.

  • **kwargs (dict) – Extra arguments to metric, passed to scipy.spatial.distance.cdist.

Returns:

ed – Euclidean distances between pairwise observations

Return type:

pandas.Series

pairwise_observation_neighborhood_wass_distance(node, include_self=False, graph_distances=None, proc=4, chunksize=None, measure_cutoff=1e-06, solvr=None)[source]#

Compute observation-pairwise Wasserstein distances between the profiles over node neighborhood.

Parameters:
  • node (int) – Node in the graph to compute the neighborhood on.

  • include_self (bool) – If True, add node in neighborhood which will result in computing normalized profile over the neighborhood. If False, node is not included in neighborhood which results in computing the transition distribution over the neighborhood.

  • graph_distances (numpy.ndarray, (n, n)) – A matrix of node-pairwise graph distances between the n nodes (ordered from \(0, 1, ..., n-1\)). If None, use hop distance.

  • measure_cutoff (float) – Threshold for treating values in profiles as zero, default = 1e-6.

  • proc (int) – Number of processor used for multiprocessing. (Default value = cpu_count()).

  • chunksize (int) – Chunksize to allocate for multiprocessing.

  • solvr (str) – Solver to pass to POT library for computing Wasserstein distance.

Returns:

wd – Wasserstein distances between pairwise observations

Return type:

pandas.Series

pairwise_observation_profile_euc_distance(features=None, label='euc_dist_observation_pairwise_profiles_t0', desc='Computing pairwise profile Euclidean distances', metric='euclidean', normalize=False, **kwargs)[source]#

Compute observation-pairwise Euclidean distances between the profiles over selected features.

Parameters:
  • features ({None, list, [int]}) – List of features to compute profile distances on. If None, all features are used.

  • label (str) – Label that resulting Euclidean distances are saved in keeper.distances and name of file to store stacked results.

  • desc (str) – Description for progress bar.

  • metric (str) – The metric used to compute the distance, passed to scipy.spatial.distance.cdist.

  • normalize (bool) – If True, normalize neighborhood profiles to sum to 1.

  • **kwargs (dict) – Extra arguments to metric, passed to scipy.spatial.distance.cdist.

Returns:

eds – Euclidean distances between pairwise observations where rows are observation-pairs and columns are node names. This is saved in keeper.distances with the key label.

Return type:

pandas.DataFrame

Notes

If object.outdir is not None, Euclidean distances are saved to file. Before starting the computation, check if the file exists. If so, load and remove already computed nodes from the iteration. Euclidean distances are computed for the remaining nodes, combined with the previously computed and saved results before saving and returning the combined results.

SAVES TO DISTANCES

pairwise_observation_profile_wass_distance(features=None, graph_distances=None, label='wass_dist_observation_pairwise_profiles_t0', desc='Computing pairwise profile Wasserstein distances', proc=4, chunksize=None, measure_cutoff=1e-06, solvr=None)[source]#

Compute observation-pairwise Wasserstein distances between the profiles over selected features.

Parameters:
  • features ({None, list [str])}) – List of features to compute profile distances on. If None, all features are used.

  • graph_distances (numpy.ndarray, (n, n)) – A matrix of node-pairwise graph distances between the \(n\) nodes (ordered from \(0, 1, ..., n-1\)). If None, use hop distance.

  • label (str) – Label that resulting Wasserstein distances are saved in keeper.distances and name of file to store stacked results.

  • desc (str) – Description for progress bar.

  • measure_cutoff (float) – Threshold for treating values in profiles as zero, default = 1e-6.

  • proc (int) – Number of processor used for multiprocessing. (Default value = cpu_count()).

  • chunksize (int) – Chunksize to allocate for multiprocessing.

  • solvr (str) – Solver to pass to POT library for computing Wasserstein distance.

Returns:

wds – Wasserstein distances between pairwise profiles where rows are observation-pairs and columns are node names. This is saved in keeper.distances with the key label.

Return type:

pandas.DataFrame

Notes

If object.outdir is not None, Wasserstein distances are saved to file every 10 iterations. Before starting the computation, check if the file exists. If so, load and remove already computed nodes from the iteration. Wasserstein distances are computed for the remaining nodes, combined with the previously computed and saved results before saving and returning the combined results.

Only nodes with at least 2 neighbors are included, as leaf nodes will all have the same Wassserstein distance and do not provide any further information.

To do: specify if nodes in input are ids or node names and check that loaded data has correct type int or str for nodes

SAVED TO DISTANCES

plot_profiles(profiles, ylog=False, ax=None, figsize=(5.3, 4), title='', lw=1.3, marker_size=2, **plot_kwargs)[source]#

plot profiles, with rows as times and columns as features

spearmanr(data, **kwargs)[source]#

Calculate a Spearman correlation coefficient with associated p-value using scipy.stats.spearmanr.

Parameters:
  • data (numpy.ndarray, (n_observations, n_features)) – 2-D array containing multiple variables and observations.

  • **kwargs (dict) – Optional key-word arguments passed to scipy.stats.spearmanr.

Returns:

  • R (pandas.DataFrame) – Spearman correlation matrix. The correlation matrix is square with length equal to total number of variables (columns or rows).

  • pvalue (float) – The p-value for a hypothesis test whose null hypotheisis is that two sets of data are uncorrelated. See documentation for scipy.stats.spearmanr for alternative hypotheses. p has the same shape as R.

stack_triu(df, name=None)[source]#

Stack the upper triangular entries of the dataframe above the diagonal

Note, this is useful for symmetric dataframes like correlations or distances.

Parameters:
  • df (pandas DataFrame) – Dataframe to stack. Note, upper triangular entries are taken from df as provided, with no check that the rows and columns are symmetric.

  • name (str) – Optional name of pandas Series output df_stacked.

Returns:

df_stacked – The stacked upper triangular entries above the diagonal of the dataframe.

Return type:

pandas Series

stack_triu_where(df, condition, name=None)[source]#

Stack the upper triangular entries of the dataframe above the diagonal where the condition is True Note, this is useful for symmetric dataframes like correlations or distances.

Parameters:
  • df (pandas DataFrame) – Dataframe to stack. Note, upper triangular entries are taken from df as provided, with no check that the rows and columns are symmetric.

  • condition (pandas DataFrame) – Boolean dataframe of the same size and order of rows and columns as df indicating values, where True, to include in the stacked dataframe.

  • name (str, optional) – Name of pandas Series output df_stacked.

Returns:

df_stacked – The stacked upper triangular entries above the diagonal of the dataframe, where condition is True.

Return type:

pandas Series

value_counter(values)[source]#

returns dictionary with the number of times each value appears.

Parameters:

values (iterable) – List of values.

Returns:

counter – Dictionary of the form {value : count} with the number of times each value appears in the iterable.

Return type:

defaultdict [value, int]

weighted_observation_network(observation, weight='weight', data=None, **kwargs)[source]#

return weighted graph by observation feature.

Parameters:
  • observation (str) – Name of observation to use.

  • weight (str) – Name of node attribute that the observation feature is saved to in the returned graph, default = ‘weight’

  • data ({None, str, pandas.DataFrame}) –

    Specify which data the observation weights should be taken from.

    data options:

    • None : The original data is used.

    • strdata is is expected to be a key in self.meta and the observation weights are taken from the

      data in the correesponding dict-value.

    • pandas.DataFrameThe dataframe is used to select the observation weights, where the observation is expected to be

      one of the columns and the rows are expected to be labelled by the node indices, \(0, 1, ..., n-1\) where \(n\) is the number of nodes in the graph.

  • kwargs (dict, optional) – Specify arguments passed to computing edge weights.

Returns:

G – The observation-specific node-weighted graph.

Return type:

networkx graph