netflow.methods.classes#
Classes to compute distances.
Classes
|
A class to compute information flow on a network and correlation between network modules |
- class netflow.methods.classes.InfoNet(keeper, graph_key, layer, verbose=None)[source]#
A class to compute information flow on a network and correlation between network modules
- Parameters:
keeper (netflow.Keeper) – The keeper object that stores the data of size (n_features, n_observations).
graph_key ('str') – The key to the graph in the graph keeper that should be used. (Does not have to include all features in the data)
layer (str) – The key to the data in the data keeper that should be used.
- __module__ = 'netflow.methods.classes'#
- anisotropic_laplacian_matrix(observation=None, use_spectral_gap=False, data=None)[source]#
returns transpose of the anisotropic random-walk Laplacian matrix
- Parameters:
observation ({None, str}) – If provided, use observation-weighted graph to construct the Laplacian. Otherwise, if None, The Laplacian is constructed from the unweighted graph, treated with uniform weights equal to 1.
use_spectral_gap (bool) – Option to use spectral gap.
data ({None, str, pandas.DataFrame}) –
Specify which data the observation profile should be taken from. Note, this is ignored if
observation
is None.data options:
None : The original data is used.
- strdata is is expected to be a key in
self.meta
and the observation profile is taken from the data in the correesponding dict-value.
- strdata is is expected to be a key in
- pandas.DataFrameThe dataframe is used to select the observation profile, where columns are observations and the rows are
expected to be labelled by the node indices, \(0, 1, ..., n-1\) where \(n\) is the number of nodes in the graph.
- Returns:
Transpose of the random-walk graph Laplacian matrix.
- Return type:
Lrw
- compute_graph_distances(G=None, weight='weight')[source]#
compute graph distances
- Parameters:
G (networkx.Graph) – The graph with nodes assumed to be labeled consecutively from \(0, 1, ..., n-1\) where \(n\) is the number of nodes.
weight (str, optional) – Edge attribute of weights used for computing the weighted hop distance. If None, compute the unweighted distance. That is, rather than minimizing the sum of weights over all paths between every two nodes, minimize the number of edges.
- Returns:
dist – A matrix of node-pairwise graph distances between the \(n\) nodes.
- Return type:
numpy.ndarray, (n, n)
- diffuse_multiple_profiles(observations=None, times=None, t_min=-1.5, t_max=2.5, n_t=10, log_time=True, laplacian=None, use_spectral_gap=False, do_plot=False, **plot_kwargs)[source]#
diffuse multiple observation profiles from original data
# get laplacian for observation or hop distance for all
- Parameters:
observations ({None, list}) – Observations to iterate over. If None, use all observations in data.
times ({None, array}) – Array of times to evaluate the diffusion simulation. Note, If given,
t_min
,t_max
andn_t
are ignored.t_min (float) – First time point to evaluate the diffusion simulation. Note,
t_min
is ignored iftimes
is not None.t_max (float) – Last time point to evaluate the diffusion simulation. Note,
t_max
must be greater thant_min
, i.e,t_max
>t_min
. Note,t_max
is ignored iftimes
is not None.n_t (int) – Number of time points to generate. Note,
n_t
is ignored iftimes
is not None.log_time (bool) – If True, return
n_t
numbers spaced evenly on a log scale, where the time sequence starts at10 ** t_min
, ends with10 ** t_max
, and the sequence of times if of the form10 ** t
wheret
is then_t
evenly spaced points between (and including)t_min
andt_max
. For example,_get_times(t_min=1, t_max=3, n_t=3, log_time=True) = array([10 ** 1, 10 ** 2, 10 ** 3]) = array([10., 100., 1000.])
. If False, returnn_t
numbers evenly spaced on a linear scale, where the sequence starts att_min
and ends witht_max
. For example,_get_times(t_min=1, t_max=3, n_t=3, log_time=False) = array([1. ,2., 3.])
.graph_distances (numpy.ndarray, (n, n)) – A matrix of node-pairwise graph distances between the \(n\) nodes ordered by the rows in
object.profiles
.laplacian (numpy.ndarray, (n, n)) – The transpose of the graph Laplacian matrix where the \(n\) rows and columns ordered by the rows in
object.profiles
. If None, the observation-specific Laplacian is used.use_spectral_gap (bool) – Option to use spectral gap. Note, This is ignored if
laplacian
is provided and not None.filename (str) – If not None,
do_plot (bool) – If True, plot diffused profiles for each observation.
**plot_kwargs (dict) – Key-word arguments passed to
plot_profiles
(should not includetitle
).
Notes
Side-effects :
- Saves pandas DataFrame of the diffused profiles, where each row is a time and each column is a feature name,
for each observation to the file name
self.outdir
/ ‘diffused_profiles’ / ‘diffused_profile_{observation
}.csv’
If
do_plot
is True, plots the diffused profile for each observation.
- diffuse_profile(observation, times=None, t_min=-1.5, t_max=2.5, n_t=10, log_time=True, laplacian=None, do_save=True)[source]#
diffuse profile from original data.
..todo: get laplacian for observation or hop distance for all
- Parameters:
observation (str) – Observation profile to use.
times ({None, array}) – Array of times to evaluate the diffusion simulation. Note, If given,
t_min
,t_max
andn_t
are ignored.t_min (float) – First time point to evaluate the diffusion simulation. Note,
t_min
is ignored iftimes
is not None.t_max (float) – Last time point to evaluate the diffusion simulation. Note,
t_max
must be greater thant_min
, i.e,t_max
>t_min
. Note,t_max
is ignored iftimes
is not None.n_t (int) – Number of time points to generate. Note,
n_t
is ignored iftimes
is not None.log_time (bool) – If True, return
n_t
numbers spaced evenly on a log scale, where the time sequence starts at10 ** t_min
, ends with10 ** t_max
, and the sequence of times if of the form10 ** t
wheret
is then_t
evenly spaced points between (and including)t_min
andt_max
. For example,_get_times(t_min=1, t_max=3, n_t=3, log_time=True) = array([10 ** 1, 10 ** 2, 10 ** 3]) = array([10., 100., 1000.])
. If False, returnn_t
numbers evenly spaced on a linear scale, where the sequence starts att_min
and ends witht_max
. For example,_get_times(t_min=1, t_max=3, n_t=3, log_time=False) = array([1. ,2., 3.])
.graph_distances (numpy.ndarray, (n, n)) – A matrix of node-pairwise graph distances between the \(n\) nodes ordered by the rows in
object.profiles
.laplacian (numpy.ndarray, (n, n)) – The transpose of the graph Laplacian matrix where the \(n\) rows and columns ordered by the rows in
object.profiles
.filename (str) – If not None, save results.
do_save (bool) – If True, save diffused profile to self.outdir / ‘diffused_profiles’ / ‘diffused_profile_{
observation
}.csv’
- Returns:
profiles – Diffused profiles where each row is a time and each column is a feature name.
- Return type:
pandas.DataFrame
- dispersion(data, axis=0)[source]#
Data dispersion computed as the absolute value of the variance-to-mean ratio where the variance and mean is computed on the values over the requested axis.
- Parameters:
data (pandas.DataFrame) – Data used to compute dispersion.
axis ({0, 1}) –
Axis on which the variance and mean is applied on computed.
Options:
0 : for each column, apply function to the values over the index
1 : for each index, apply function to the values over the columns
- Returns:
vmr – Variance-to-mean ratio (vmr) quantifying the disperion.
- Return type:
pandas.Series
- invariant_measures(label='IM')[source]#
Compute the invariant measure for each observation using data as node weights.
- Parameters:
label (str) – Label of key used to store invariant measures in the data keeper.
- Returns:
Makes the invariant measures attribute of size (n_observations, n_observations)
in
keeper.data[label]
available.
- kendall_tau(data, **kwargs)[source]#
Calculate Kendall’s tau correlation coefficient with associated p-value using scipy.stats.kendalltau.
- Parameters:
data (numpy.ndarray, (n_observations, n_features)) – 2-D array containing multiple variables and observations.
**kwargs (dict) – Optional key-word arguments passed to scipy.stats.kendalltau.
- Returns:
R (pandas.DataFrame) – Spearman correlation matrix. The correlation matrix is square with length equal to total number of variables (columns or rows).
pvalue (float) – The p-value for a hypothesis test whose null hypotheisis is that two sets of data are uncorrelated. See documentation for scipy.stats.spearmanr for alternative hypotheses.
p
has the same shape asR
.
- load_diffused_profile(observation)[source]#
loads observation diffused profile
- Parameters:
observation (str) – Observation profile to load.
- Returns:
diffused_profiles – Diffused profiles where each row is a time and each column is a feature name.
- Return type:
pandas.DataFrame
- load_diffused_timepoint_profile(time, observations=None)[source]#
loads observation diffused profile.
- Parameters:
time (float) – Timepoint in diffusion simulation to select.
observations ({None, list}) – List of observations to iterate over. If None, all observations in
self.data
are used.
- Returns:
diffused_profiles – Diffused profiles at
time
, for allobservations
, where rows are nodes and columns are observations.- Return type:
pandas.DataFrame
- multiple_pairwise_observation_neighborhood_euc_distance(nodes=None, include_self=False, label='pw_obs_nbhd_euc_dist', desc='Computing pairwise 1-hop distances', profiles_desc='t0', metric='euclidean', normalize=False, **kwargs)[source]#
Compute observation-pairwise Euclidean distances between the profiles over node neighborhood for all nodes.
- Parameters:
nodes ({None, list, [int]}) – List of nodes to compute neighborhood distances on. If None, all nodes with at least 2 neighbors are used.
include_self (bool) – If True, add node in neighborhood which will result in computing normalized profile over the neighborhood. If False, node is not included in neighborhood which results in computing the transition distribution over the neighborhood.
label (str) – Label that resulting Euclidean distances are saved in
keeper.misc
.desc (str) – Description for progress bar.
profiles_desc (str, default = “t0”) – Description of profiles used in name of file to store results.
metric (str) – The metric used to compute the distance, passed to scipy.spatial.distance.cdist.
normalize (bool) – If True, normalize neighborhood profiles to sum to 1.
**kwargs (dict) – Extra arguments to metric, passed to scipy.spatial.distance.cdist.
- Returns:
eds – Euclidean distances between pairwise observations where rows are observation-pairs and columns are node names. This is saved in
keeper.misc
with the keylabel
.- Return type:
pandas.DataFrame
Notes
If
object.outdir
is not None, Euclidean distances are saved to file. Before starting the computation, check if the file exists. If so, load and remove already computed nodes from the iteration. Wasserstein distances are computed for the remaining nodes, combined with the previously computed and saved results before saving and returning the combined results.
- multiple_pairwise_observation_neighborhood_wass_distance(nodes=None, include_self=False, graph_distances=None, label='pw_obs_nbhd_wass_dist', desc='Computing pairwise 1-hop distances', profiles_desc='t0', proc=4, chunksize=None, measure_cutoff=1e-06, solvr=None)[source]#
Compute observation-pairwise Wasserstein distances between the profiles over node neighborhood for all nodes.
- Parameters:
nodes ({None, list [str])}) – List of nodes to compute neighborhood distances on. If None, all genes with at least 2 neighbors is used.
include_self (bool) – If True, add node in neighborhood which will result in computing normalized profile over the neighborhood. If False, node is not included in neighborhood which results in computing the transition distribution over the neighborhood.
graph_distances (numpy.ndarray, (n, n)) – A matrix of node-pairwise graph distances between the \(n\) nodes (ordered from \(0, 1, ..., n-1\)). If None, use hop distance.
label (str) – Label that resulting Wasserstein distances are saved in
keeper.misc
.desc (str) – Description for progress bar.
profiles_desc (str, default = 't0') – Description of profiles used in name of file to store results.
measure_cutoff (float) – Threshold for treating values in profiles as zero, default = 1e-6.
proc (int) – Number of processor used for multiprocessing. (Default value = cpu_count()).
chunksize (int) – Chunksize to allocate for multiprocessing.
solvr (str) – Solver to pass to POT library for computing Wasserstein distance.
- Returns:
wds – Wasserstein distances between pairwise observations where rows are observation-pairs and columns are node names. This is saved in
keeper.misc
with the keylabel
.- Return type:
pandas.DataFrame
Notes
If
object.outdir
is not None, Wasserstein distances are saved to file every 10 iterations. Before starting the computation, check if the file exists. If so, load and remove already computed nodes from the iteration. Wasserstein distances are computed for the remaining nodes, combined with the previously computed and saved results before saving and returning the combined results.Only nodes with at least 2 neighbors are included, as leaf nodes will all have the same Wassserstein distance and do not provide any further information.
To do: specify if nodes in input are ids or node names and check that loaded data has correct type int or str for nodes
- neighborhood(node, include_self=False)[source]#
return neighborhood of the node.
- Parameters:
node (int) – Node of interest.
include_self (bool) – If True, include
node
in its neighborhood.
- Returns:
neighborhood – List of nodes in the neighborhood of
node
.- Return type:
list
- neighborhood_profiles(node, include_self=False)[source]#
return profiles on the neighborhood of the node.
- Parameters:
node (int) – Node in the graph to compute the neighborhood on.
include_self (bool) – If True, add node in neighborhood which will result in computing normalized profile over the neighborhood. If False, node is not included in neighborhood which results in computing the transition distribution over the neighborhood.
- Returns:
neighborhood (list) – List of nodes in the neighborhood of
node
.sub_profiles (pandas.DataFrame) – The profiles over the node neighborhood.
- pairwise_observation_neighborhood_euc_distance(node, include_self=False, metric='euclidean', normalize=False, **kwargs)[source]#
Compute observation-pairwise Euclidean distances between the profiles over node neighborhood.
- Parameters:
profiles (pandas.DataFrame (n_features, n_observations)) – Profiles that Euclidean distance is computed between.
node (int) – Node in the graph to compute the neighborhood on.
include_self (bool) – If True, include node in neighborhood. If False, node is not included in neighborhood.
metric (str) – The metric used to compute the distance, passed to scipy.spatial.distance.cdist.
normalize (bool) – If True, normalize neighborhood profiles to sum to 1.
**kwargs (dict) – Extra arguments to metric, passed to scipy.spatial.distance.cdist.
- Returns:
ed – Euclidean distances between pairwise observations
- Return type:
pandas.Series
- pairwise_observation_neighborhood_wass_distance(node, include_self=False, graph_distances=None, proc=4, chunksize=None, measure_cutoff=1e-06, solvr=None)[source]#
Compute observation-pairwise Wasserstein distances between the profiles over node neighborhood.
- Parameters:
node (int) – Node in the graph to compute the neighborhood on.
include_self (bool) – If True, add node in neighborhood which will result in computing normalized profile over the neighborhood. If False, node is not included in neighborhood which results in computing the transition distribution over the neighborhood.
graph_distances (numpy.ndarray, (n, n)) – A matrix of node-pairwise graph distances between the n nodes (ordered from \(0, 1, ..., n-1\)). If None, use hop distance.
measure_cutoff (float) – Threshold for treating values in profiles as zero, default = 1e-6.
proc (int) – Number of processor used for multiprocessing. (Default value = cpu_count()).
chunksize (int) – Chunksize to allocate for multiprocessing.
solvr (str) – Solver to pass to POT library for computing Wasserstein distance.
- Returns:
wd – Wasserstein distances between pairwise observations
- Return type:
pandas.Series
- pairwise_observation_profile_euc_distance(features=None, label='euc_dist_observation_pairwise_profiles_t0', desc='Computing pairwise profile Euclidean distances', metric='euclidean', normalize=False, **kwargs)[source]#
Compute observation-pairwise Euclidean distances between the profiles over selected features.
- Parameters:
features ({None, list, [int]}) – List of features to compute profile distances on. If None, all features are used.
label (str) – Label that resulting Euclidean distances are saved in
keeper.distances
and name of file to store stacked results.desc (str) – Description for progress bar.
metric (str) – The metric used to compute the distance, passed to scipy.spatial.distance.cdist.
normalize (bool) – If True, normalize neighborhood profiles to sum to 1.
**kwargs (dict) – Extra arguments to metric, passed to scipy.spatial.distance.cdist.
- Returns:
eds – Euclidean distances between pairwise observations where rows are observation-pairs and columns are node names. This is saved in
keeper.distances
with the keylabel
.- Return type:
pandas.DataFrame
Notes
If
object.outdir
is not None, Euclidean distances are saved to file. Before starting the computation, check if the file exists. If so, load and remove already computed nodes from the iteration. Euclidean distances are computed for the remaining nodes, combined with the previously computed and saved results before saving and returning the combined results.SAVES TO DISTANCES
- pairwise_observation_profile_wass_distance(features=None, graph_distances=None, label='wass_dist_observation_pairwise_profiles_t0', desc='Computing pairwise profile Wasserstein distances', proc=4, chunksize=None, measure_cutoff=1e-06, solvr=None)[source]#
Compute observation-pairwise Wasserstein distances between the profiles over selected features.
- Parameters:
features ({None, list [str])}) – List of features to compute profile distances on. If None, all features are used.
graph_distances (numpy.ndarray, (n, n)) – A matrix of node-pairwise graph distances between the \(n\) nodes (ordered from \(0, 1, ..., n-1\)). If None, use hop distance.
label (str) – Label that resulting Wasserstein distances are saved in
keeper.distances
and name of file to store stacked results.desc (str) – Description for progress bar.
measure_cutoff (float) – Threshold for treating values in profiles as zero, default = 1e-6.
proc (int) – Number of processor used for multiprocessing. (Default value = cpu_count()).
chunksize (int) – Chunksize to allocate for multiprocessing.
solvr (str) – Solver to pass to POT library for computing Wasserstein distance.
- Returns:
wds – Wasserstein distances between pairwise profiles where rows are observation-pairs and columns are node names. This is saved in
keeper.distances
with the keylabel
.- Return type:
pandas.DataFrame
Notes
If
object.outdir
is not None, Wasserstein distances are saved to file every 10 iterations. Before starting the computation, check if the file exists. If so, load and remove already computed nodes from the iteration. Wasserstein distances are computed for the remaining nodes, combined with the previously computed and saved results before saving and returning the combined results.Only nodes with at least 2 neighbors are included, as leaf nodes will all have the same Wassserstein distance and do not provide any further information.
To do: specify if nodes in input are ids or node names and check that loaded data has correct type int or str for nodes
SAVED TO DISTANCES
- plot_profiles(profiles, ylog=False, ax=None, figsize=(5.3, 4), title='', lw=1.3, marker_size=2, **plot_kwargs)[source]#
plot profiles, with rows as times and columns as features
- spearmanr(data, **kwargs)[source]#
Calculate a Spearman correlation coefficient with associated p-value using scipy.stats.spearmanr.
- Parameters:
data (numpy.ndarray, (n_observations, n_features)) – 2-D array containing multiple variables and observations.
**kwargs (dict) – Optional key-word arguments passed to scipy.stats.spearmanr.
- Returns:
R (pandas.DataFrame) – Spearman correlation matrix. The correlation matrix is square with length equal to total number of variables (columns or rows).
pvalue (float) – The p-value for a hypothesis test whose null hypotheisis is that two sets of data are uncorrelated. See documentation for scipy.stats.spearmanr for alternative hypotheses.
p
has the same shape asR
.
- stack_triu(df, name=None)[source]#
Stack the upper triangular entries of the dataframe above the diagonal
Note, this is useful for symmetric dataframes like correlations or distances.
- Parameters:
df (pandas DataFrame) – Dataframe to stack. Note, upper triangular entries are taken from
df
as provided, with no check that the rows and columns are symmetric.name (str) – Optional name of pandas Series output
df_stacked
.
- Returns:
df_stacked – The stacked upper triangular entries above the diagonal of the dataframe.
- Return type:
pandas Series
- stack_triu_where(df, condition, name=None)[source]#
Stack the upper triangular entries of the dataframe above the diagonal where the condition is True Note, this is useful for symmetric dataframes like correlations or distances.
- Parameters:
df (pandas DataFrame) – Dataframe to stack. Note, upper triangular entries are taken from
df
as provided, with no check that the rows and columns are symmetric.condition (pandas DataFrame) – Boolean dataframe of the same size and order of rows and columns as df indicating values, where True, to include in the stacked dataframe.
name (str, optional) – Name of pandas Series output
df_stacked
.
- Returns:
df_stacked – The stacked upper triangular entries above the diagonal of the dataframe, where
condition
is True.- Return type:
pandas Series
- value_counter(values)[source]#
returns dictionary with the number of times each value appears.
- Parameters:
values (iterable) – List of values.
- Returns:
counter – Dictionary of the form {value : count} with the number of times each value appears in the iterable.
- Return type:
defaultdict [value, int]
- weighted_observation_network(observation, weight='weight', data=None, **kwargs)[source]#
return weighted graph by observation feature.
- Parameters:
observation (str) – Name of observation to use.
weight (str) – Name of node attribute that the observation feature is saved to in the returned graph, default = ‘weight’
data ({None, str, pandas.DataFrame}) –
Specify which data the observation weights should be taken from.
data options:
None : The original data is used.
- str
data
is is expected to be a key inself.meta
and the observation weights are taken from the data in the correesponding dict-value.
- str
- pandas.DataFrameThe dataframe is used to select the observation weights, where the observation is expected to be
one of the columns and the rows are expected to be labelled by the node indices, \(0, 1, ..., n-1\) where \(n\) is the number of nodes in the graph.
kwargs (dict, optional) – Specify arguments passed to computing edge weights.
- Returns:
G – The observation-specific node-weighted graph.
- Return type:
networkx graph