netflow.keepers.keeper#

keeper#

Classes used for data storage and manipulation.

Classes

DataKeeper([data, observation_labels])

A class to store and handle multiple data sets.

DataView([dkeeper, label])

A class to extract single data view from DataKeeper for analysis.

DistanceKeeper([data, observation_labels])

A class to store and handle multiple distances.

DistanceView([dkeeper, label])

A class to extract single distance matrix view from DistanceKeeper for analysis.

GraphKeeper([graphs])

A class to store and handle multiple netowrks.

Keeper([data, distances, similarities, ...])

A class to store and handle data, distances and similarities between data points (or observations), and miscellaneous related results.

class netflow.keepers.keeper.DataKeeper(data=None, observation_labels=None)[source]#

A class to store and handle multiple data sets.

Parameters:
  • data ({numpy.ndarray, pandas.DataFrame, dict [str, numpy.ndarray], dict [str, pandas.DataFrame]}) –

    One or multiple feature data sets, where each data set is size (num_features, num_observations).

    Feature data set(s) may be provided in multiple ways:

    • numpy.ndarray : A single feature data set.

      • Observation labels may be specified in observation_labels.

      • To include feature labels, data should be a pandas.DataFrame.

      • The array data is placed in a dict with default label of the form {'data' : data}.

    • pandas.DataFrame : A single feature data set.

      • The index indicates feature labels.

      • Columns indicate observation labels.

      • If observation_labels is provided, it should match the columns of the dataframe, that will be ordered according to the observation_labels.

      • The array data.values is placed in a dict with default label of the form {'data' : data.values}.

    • dict [str, numpy.ndarray]A single or mutliple feature data set(s) may be provided as

      the value(s) of a dict keyed by a str that serves as the feature data descriptor or label for each input of the form {'data_label' : `numpy.ndarray`}.

      • All arrays are expected to have the same number of columns corresponding to the number of observations with the same ordering.

      • Observation labels may be specified in observation_labels.

      • To include feature labels, pass a dict of pandas.DataFrames instaed.

    • dict [str, pandas.DataFrame]A single or mutliple feature data set(s) may be provided as

      the value(s) of a dict keyed by a str that serves as the feature data descriptor or label for each input of the form {'data_label' : `pandas.DataFrame`}.

      • All dataframes are expected to have the same number of columns corresponding to the number of observations with the same columns provided in the same order.

      • The index of each input indicates feature labels.

      • Columns indicate observation labels.

      • If observation_labels is provided, it should match the columns of the dataframe(s), that will be ordered according to the observation_labels.

  • observation_labels (list [str], optional) –

    List of labels for each observation with length equal to num_observations. If provided when data is a pandas.DataFrame or dict [str, pandas.DataFrame], it should match the columns of the dataframe(s), which will be ordered according to observation_labels.

    If not provided and data is a :

    • pandas.DataFrame or dict [str, pandas.DataFrame], then the columns of the (first) dataframe are used.

    • numpy.ndarray or dict [str, numpy.ndarray], then the columns are labeled by their row index from \(0, 1, ..., num\_observations - 1\).

Notes

All data sets are assumed to contain the same set of data points in the same order.

__contains__(key)[source]#
__getitem__(key)[source]#
__init__(data=None, observation_labels=None)[source]#
__iter__()[source]#
__module__ = 'netflow.keepers.keeper'#
add_data(data, label)[source]#

Add a feature data set to the keeper.

Note

Observation labels may be set after object initialization. This may require upkeep in Keeper to update and coordinate observation labels among the different keepers.

Parameters:
  • data ({numpy.ndarray, pandas.DataFrame}) – The data set of size (num_features, num_observations).

  • label (str) – Reference label describing the data set.

property data#

A dictionary of all data sets.

property features_labels#

A dictionary of feature labels for each data set.

items()[source]#
keys()[source]#
property num_features#

A dictionary with the number of features in each data set.

property num_observations#

Number of observations.

observation_index(observation_label)[source]#

Return index of observation.

Parameters:

observation_label (str) – The observation label.

Returns:

observation_index – Index of observation in list of observation labels.

Return type:

int

property observation_labels#

Labels for each observation.

standardize(key, **kwargs)[source]#

Standardize features by removing the mean and scaling to unit variance

Parameters:
  • key (str) – The reference key of the data in the data-keeper that will be standardized.

  • kwargs – Keyword arguments passed to sklearn.preprocessing.StandardScalar.

Returns:

data_z – The standardized data.

Return type:

pandas.DataFrame

subset(observations)[source]#

Return a new instance of DataKeeper restricted to subset of observations

Parameters:

observations (list,) –

List of observations to include in the subset. This is treated differently depending on the type of observation labels :

  • If self._observation_labels is List [str], observations can be of the form:

    • List [str], to reference observations by their str label or;

    • List [int], to reference observations by their location index.

  • If self._observation_labels is None or List [int], observations must be of the form List [int], to reference observations by their location index.

Returns:

data_subset – A DataKeeper object restricted to the selected observations.

Return type:

DataKeeper

class netflow.keepers.keeper.DataView(dkeeper=None, label=None)[source]#

A class to extract single data view from DataKeeper for analysis.

Parameters:
  • dkeeper (DataKeeper) – The object to extract data from.

  • label (str) – The identifier of the data to be extracted from dkeeper.

__init__(dkeeper=None, label=None)[source]#
__module__ = 'netflow.keepers.keeper'#
property data#

The data set.

feature_index(feature_label)[source]#

Return index of feature.

Parameters:

feature_label (str) – The feature label.

Returns:

feature_index – Index of feature in list of feature labels.

Return type:

int

property feature_labels#

Feature labels.

property label#

The data label.

property num_features#

The number of features in the data set.

property num_observations#

Number of observations.

observation_index(observation_label)[source]#

Return index of observation.

Parameters:

observation_label (str) – The observation label.

Returns:

observation_index – Index of observation in list of observation labels.

Return type:

int

property observation_labels#

Labels for each observation.

standardize(**kwargs)[source]#

Standardize features by removing the mean and scaling to unit variance :param kwargs: Keyword arguments passed to sklearn.preprocessing.StandardScalar.

Returns:

data_z – The standardized data.

Return type:

pandas.DataFrame

subset(observations=None, features=None)[source]#

Return data for specified subset of observations and/or features.

Parameters:
  • observations (list,) –

    List of observations to include in the subset. This is treated differently depending on the type of observation labels :

    • If self._observation_labels is List [str], observations can be of the form:

      • List [str], to reference observations by their str label or;

      • List [int], to reference observations by their location index.

    • If self._observation_labels is None or List [int], observations must be of the form List [int], to reference observations by their location index.

  • features (list) – List of features to include in the subset. The list type depends on the type of self._feature_labels, analogous to self._observation_labels for observations above.

Returns:

data – The data subset where the \(ij^{th}\) entry is the value of the \(i^{th}\) feature in features for the \(j^{th}\) observation in observations.

Return type:

pandas.DataFrame, (len(observations), len(features))

to_frame()[source]#

Return data as a pandas DataFrame

class netflow.keepers.keeper.DistanceKeeper(data=None, observation_labels=None)[source]#

A class to store and handle multiple distances.

This class may also be used to store and handle similarity matrices. Distance may be interchanged with similarity, but distance is used for simplicity.

Parameters:
  • data ({numpy.ndarray, pandas.DataFrame, dict [str, numpy.ndarray], dict [str, pandas.DataFrame]}) –

    One or multiple symmetric distance(s) between observations, where each distance matrix is size (num_observations, num_observations).

    Distance(s) may be provided in multiple ways:

    • numpy.ndarray : A single distance matrix.

      • Observation labels may be specified in observation_labels.

      • To include feature labels, data should be a pandas.DataFrame.

      • The array data is placed in a dict with default label of the form {'distance' : data}.

    • pandas.DataFrame : A single distance matrix.

      • The index should be the same as the columns, which indicate observation labels.

      • If observation_labels is provided, it should match the rows and columns of the dataframe, that will be ordered according to the observation_labels.

      • The array data.values is placed in a dict with default label of the form {'distance' : data.values}.

    • dict [str, numpy.ndarray]A single or mutliple distance(s) may be provided as the value(s)

      of a dict keyed by a str that serves as the distance descriptor or label for each input of the form {'distance_label' : `numpy.ndarray`}.

      • All arrays are expected to have the same number of columns and rows corresponding to the number of observations with the same ordering.

      • Observation labels may be specified in observation_labels.

    • dict [str, pandas.DataFrame]A single or mutliple distance(s) may be provided as the value(s)

      of a dict keyed by a str that serves as the distance descriptor or label for each input of the form {'distance_label' : `pandas.DataFrame`}.

      • All dataframes are expected to have the same number of rows and columns corresponding to the number of observations with the same columns provided in the same order.

      • The index and columns indicate observation labels.

      • If observation_labels is provided, it should match the columns of the dataframe(s), that will be ordered according to the observation_labels.

  • observation_labels (list [str]) –

    List of labels for each observation with length equal to num_observations. If provided when data is a pandas.DataFrame or dict [str, pandas.DataFrame], it should match the columns of the dataframe(s), which will be ordered according to observation_labels.

    If not provided and data is a :

    • pandas.DataFrame or dict [str, pandas.DataFrame], then the columns of the (first) dataframe are used.

    • numpy.ndarray or dict [str, numpy.ndarray], then the columns are labeled by their row index from \(0, 1, ..., num_observations - 1\).

Notes

All sets are assumed to contain the same set of data points in the same order.

__contains__(key)[source]#
__getitem__(key)[source]#
__init__(data=None, observation_labels=None)[source]#
__iter__()[source]#
__module__ = 'netflow.keepers.keeper'#
add_data(data, label)[source]#

Add a symmeric distance matrix to the keeper.

Parameters:
  • data ({numpy.ndarray, pandas.DataFrame}) – The distance matrix of size (num_observations, num_observations).

  • label (str) – Reference label describing the input.

add_stacked_data(data, label, diag=0.0)[source]#

Add a symmetric distance from stacked Series to the keeper.

Parameters:
  • data (pandas.Series) – The stacked distances of size (num_observations * (num_observations - 1) / 2,) with a 2-multi-index of the pairwise observation labels.

  • label (str) – Reference label describing the input.

  • diag (float) – Value used on diagonal.

property data#

A dictionary of all distances.

items()[source]#
keys()[source]#
property num_observations#

Number of observations.

observation_index(observation_label)[source]#

Return index of observation.

Parameters:

observation_label (str) – The observation label.

Returns:

observation_index – Index of observation in list of observation labels.

Return type:

int

property observation_labels#

Labels for each observation.

subset(observations)[source]#

Return a new instance of DistanceKeeper restricted to subset of observations

Parameters:

observations (list,) –

List of observations to include in the subset. This is treated differently depending on the type of observation labels :

  • If self._observation_labels is List [str], observations can be of the form:

    • List [str], to reference observations by their str label or;

    • List [int], to reference observations by their location index.

  • If self._observation_labels is None or List [int], observations must be of the form List [int], to reference observations by their location index.

Returns:

distance_subset – A DistnaceKeeper object restricted to the selected observations.

Return type:

DistanceKeeper

class netflow.keepers.keeper.DistanceView(dkeeper=None, label=None)[source]#

A class to extract single distance matrix view from DistanceKeeper for analysis.

This class may also be used to extract a similarity matrix. Distance may be interchanged with similarity, but distance is used for simplicity.

Parameters:
  • dkeeper (DistanceKeeper) – The object to extract the distance from.

  • label (str) – The identifier of the distance matrix to be extracted from dkeeper.

__init__(dkeeper=None, label=None)[source]#
__module__ = 'netflow.keepers.keeper'#
property data#

The distance matrix.

density()[source]#

Return density of each observation

The density of an observation is its net distance to all other observations. This should be minimized for distances and maximized for similarities.

Returns:

d – The densities indexed by the observations.

Return type:

pandas.Series

property label#

The distance label.

property num_observations#

Number of observations.

observation_index(observation_label)[source]#

Return index of observation.

Parameters:

observation_label (str) – The observation label.

Returns:

observation_index – Index of observation in list of observation labels.

Return type:

int

property observation_labels#

Labels for each observation.

subset(observations_a, observations_b=None)[source]#

Return subset of distances between observations_a and observations_b.

Parameters:
  • observations_a (list) –

    Subset of observations to extract distances from, that make up the rows of the returned sub-distance matrix. This is treated differently depending on the type of observation labels :

    • If self._observation_labels is List [str], observations_a can be of the form:

      • List [str], to reference observations by their str label or;

      • List [int], to reference observations by their location index.

    • If self._observation_labels is None or List [int], observations must be of the form List [int], to reference observations by their location index.

  • observations_b ({None, list}, optional) – Subset of observations to extract distances computed from observations_a. The list type depends on the type of self._feature_labels, analogous to self._observation_labels for observations_a above. If None, observations_a is used to create symmetric matrix.

Returns:

distance – The sub-matrix of distances where the \(ij^{th}\) entry is the distance between the \(i^{th}\) observation in observations_a and the \(j^{th}\) observation in observations_b.

Return type:

pandas.DataFrame, (len(observations_a), len(observations_b))

to_frame()[source]#

Return data as a pandas DataFrame

class netflow.keepers.keeper.GraphKeeper(graphs=None)[source]#

A class to store and handle multiple netowrks.

Parameters:

graphs ({networkx.Graph, dict [str’, `networkx.Graph]}) –

One or multiple networks.

The network(s) may be provided in multiple ways:

  • networkx.Graph : A single network.

    • The network is placed in a dict with default label of the form {'graph' : graphs}.

    • To use a customized label for the network, provide the network as a dict, shown next.

  • dict [str, networkx.Graph] : A single or multiple networks may be provided as the value(s) of a dict keyed by a str that serves as the netwwork descriptor or label for each input of the form {'graph_label' : `networkx.Graph`}.

Notes

The graph label is also set to the graph’s name.

__contains__(key)[source]#
__getitem__(key)[source]#
__init__(graphs=None)[source]#
__iter__()[source]#
__module__ = 'netflow.keepers.keeper'#
add_graph(graph, label)[source]#

Add a network to the keeper.

Parameters:
  • graph (networkx.Graph) – The network.

  • label (str) – Reference label describing the network.

property graphs#

A dictionary of all the graphs.

items()[source]#
keys()[source]#
class netflow.keepers.keeper.Keeper(data=None, distances=None, similarities=None, graphs=None, misc=None, observation_labels=None, outdir=None, verbose=None)[source]#

A class to store and handle data, distances and similarities between data points (or observations), and miscellaneous related results.

Parameters:
  • data ({numpy.ndarray, pandas.DataFrame, dict [str, numpy.ndarray], dict [str, pandas.DataFrame]}) –

    One or multiple feature data sets, where each data set is size (num_features, num_observations).

    Feature data set(s) may be provided in multiple ways:

    • numpy.ndarray : A single feature data set.

      • Observation labels may be specified in observation_labels.

      • To include feature labels, data should be a pandas.DataFrame.

      • The array data is placed in a dict with default label of the form {'data' : data}.

    • pandas.DataFrame : A single feature data set.

      • The index indicates feature labels.

      • Columns indicate observation labels.

      • If observation_labels is provided, it should match the columns of the dataframe, that will be ordered according to the observation_labels.

      • The array data.values is placed in a dict with default label of the form {'data' : data.values}.

    • dict [str, numpy.ndarray]A single or mutliple feature data set(s) may be provided as the value(s)

      of a dict keyed by a str that serves as the feature data descriptor or label for each input of the form {'data_label' : `numpy.ndarray`}.

      • All arrays are expected to have the same number of columns corresponding to the number of observations with the same ordering.

      • Observation labels may be specified in observation_labels.

      • To include feature labels, pass a dict of pandas.DataFrames instaed.

    • dict [str, pandas.DataFrame]A single or mutliple feature data set(s) may be provided as the value(s)

      of a dict keyed by a str that serves as the feature data descriptor or label for each input of the form {'data_label' : `pandas.DataFrame`}.

      • All dataframes are expected to have the same number of columns corresponding to the number of observations with the same columns provided in the same order.

      • The index of each input indicates feature labels.

      • Columns indicate observation labels.

      • If observation_labels is provided, it should match the columns of the dataframe(s), that will be ordered according to the observation_labels.

  • distances ({numpy.ndarray, pandas.DataFrame, dict [str, numpy.ndarray], dict [str, pandas.DataFrame]}) –

    One or multiple symmetric distance(s) between observations, where each distance matrix is size (num_observations, num_observations).

    Distance(s) may be provided in multiple ways:

    • numpy.ndarray : A single distance matrix.

      • Observation labels may be specified in observation_labels.

      • To include feature labels, distances should be a pandas.DataFrame.

      • The array distances is placed in a dict with default label of the form {'distance' : distances}.

    • pandas.DataFrame : A single distance matrix.

      • The index should be the same as the columns, which indicate observation labels.

      • If observation_labels is provided, it should match the rows and columns of the dataframe, that will be ordered according to the observation_labels.

      • The array distances.values is placed in a dict with default label of the form {'distance' : distances.values}.

    • dict [str, numpy.ndarray]A single or mutliple distance(s) may be provided as the value(s)

      of a dict keyed by a str that serves as the distance descriptor or label for each input of the form {'distance_label' : `numpy.ndarray`}.

      • All arrays are expected to have the same number of columns and rows corresponding to the number of observations with the same ordering.

      • Observation labels may be specified in observation_labels.

    • dict [str, pandas.DataFrame]A single or mutliple distance(s) may be provided as the value(s)

      of a dict keyed by a str that serves as the distance descriptor or label for each input of the form {'distance_label' : `pandas.DataFrame`}.

      • All dataframes are expected to have the same number of rows and columns corresponding to the number of observations with the same columns provided in the same order.

      • The index and columns indicate observation labels.

      • If observation_labels is provided, it should match the columns of the dataframe(s), that will be ordered according to the observation_labels.

  • similarities ({numpy.ndarray, pandas.DataFrame, dict [str, numpy.ndarray], dict [str, pandas.DataFrame]}) – One or multiple symmetric similarit(y/ies) between observations, where each similarity matrix is size (num_observations, num_observations). Similarit(y/ies) may be provided in multiple ways, analogous to distances.

  • graphs (: {networkx.Graph, dict [str’, `networkx.Graph]}) –

    One or multiple networks.

    The network(s) may be provided in multiple ways:

    • networkx.Graph : A single network.

      • The network is placed in a dict with default label of the form {'graph' : graphs}.

      • To use a customized label for the network, provide the network as a dict, shown next.

    • dict [str, networkx.Graph] : A single or multiple networks may be provided as the value(s) of a dict keyed by a str that serves as the netwwork descriptor or label for each input of the form {'graph_label' : `networkx.Graph`}.

  • misc (dict) – Miscellaneous data or results.

  • observation_labels (list [str], optional) –

    List of labels for each observation with length equal to num_observations.

    Labels will be set depending on the input format accordingly :

    • If observation_labels is provided and data, distances, or similarities is a :

      • pandas.DataFrame or dict [str, pandas.DataFrame], then observation_labels should match the columns of the dataframe(s), which will be ordered according to observation_labels.

      • numpy.ndarray or dict [str, numpy.ndarray], then the array(s) is (are) assumed to be ordered according to observation_labels.

    • If observation_labels is not provided and data, distances, or similarities is a :

      • pandas.DataFrame or dict [str, pandas.DataFrame], then all dataframes are expected to have the same column names, which is used as the observation_labels.

      • numpy.ndarray or dict [str, numpy.ndarray], then the array(s) is (are) assumed to have columns corresponding to the observations in the same order. Default values are used for observation_labels of the form ‘X0’, ‘X1’, … and so on.

  • outdir ({None, str pathlib.Path}) – Global path where any results will be saved. If None, no results will be saved.

Notes

All data sets, distances and similarities are assumed to contain the same set of data points in the same order.

Subsets of a data set, distance or similarity should be stored in Keeper.misc as a pandas.DataFrame to keep track of the subset of observations (and features).

PCA(key, n_components=None, random_state=None)[source]#

Principle component analysis (PCA) decomposition.

Parameters:
  • key (str) – The reference key of the data in the data-keeper that PCA decomposition will be performed on.

  • n_components ({None, int}) – The number of principle components to keep. If None, all principle components are kept.

  • random_state ({None, int}) – Random state used for certain solvers. Pass an int for reproducible results across runs.

Return type:

PCA data with label “{key}_PCA” is added to the data keeper.

__init__(data=None, distances=None, similarities=None, graphs=None, misc=None, observation_labels=None, outdir=None, verbose=None)[source]#
__module__ = 'netflow.keepers.keeper'#
_check_num_observations()[source]#

Check that num_observation are consistent (or None) across keepers.

_check_observation_labels()[source]#

Check that observation_labels are consistent (or None) across keepers.

add_data(data, label)[source]#

Add a feature data set to the data keeper.

Parameters:
  • data ({numpy.ndarray, pandas.DataFrame}) – The data set of size (num_features, num_observations).

  • label (str) – Reference label describing the data set.

add_distance(data, label)[source]#

Add a distance array to the distance keeper.

Parameters:
  • data ({numpy.ndarray, pandas.DataFrame}) – The distance array of size (num_observations, num_observations).

  • label (str) – Reference label describing the distance.

add_graph(graph, label)[source]#

Add a network to the graph keeper.

Parameters:
  • graph (networkx.Graph) – The network to add.

  • label (str) – Reference label describing the network.

add_misc(data, label)[source]#

Add misc information to be stored.

Parameters:
  • data – The misc information, e.g., a graph.

  • label (str) – Reference label describing the input.

add_similarity(data, label)[source]#

Add a similarity array to the similarity keeper.

Parameters:
  • data ({numpy.ndarray, pandas.DataFrame}) – The similarity array of size (num_observations, num_observations).

  • label (str) – Reference label describing the similarity.

add_stacked_distance(data, label)[source]#

Add a stacked distance array to the distance keeper.

Parameters:
  • data (pandas.Series) – The stacked distances of size (num_observations * (num_observations - 1) / 2,) with a 2-multi-index of the pairwise observation labels.

  • label (str) – Reference label describing the distance.

add_stacked_similarity(data, label, diag=1.0)[source]#

Add a stacked similarity array to the similarity keeper.

Parameters:
  • data (pandas.Series) – The stacked similarities of size (num_observations * (num_observations - 1) / 2,) with a 2-multi-index of the pairwise observation labels.

  • label (str) – Reference label describing the similarity.

  • diag (float) – Value used on diagonal.

compute_dpt_from_augmented_sym_transitions(key, n_comps: int = 0, save_eig=False)[source]#

Compute the diffusion pseudotime metric between observations, computed from the symmetric transitions.

Note

  • \(T\) is the symmetric transition matrix

  • \(M(x,z) = \sum_{i=1}^{n-1} (\lambda_i * (1 - \lambda_i))\psi_i(x)\psi_i^T(z)\)

  • \(dpt(x,z) = ||M(x, .) - M(y, .)||^2\)

Parameters:

key (str) – Reference ID for the symmetric transitions numpy.ndarray, (n_observations, n_observations) stored in keeper.misc.

Returns:

  • dpt (numpy.ndarray, (n_observations, n_observations)) – Pairwise-observation Diffusion pseudotime distances are stored in keeper.distances[dpt_key] where dpt_key="dpt_from_{key}". If the full spectrum is not used (i.e., 0 < n_comps < n_observations"), then dpt_key="dpt_from_{key}_{n_comps}comps".

  • n_comps (int) – Number of eigenvalues/vectors to be computed, set n_comps = 0 to compute the whole spectrum. Alternatively, if set n_comps >= n_observations, the whole spectrum will be computed.

compute_dpt_from_similarity(similarity_key, density_normalize: bool = True, n_comps: int = 0, save_eig=False)[source]#

Compute the diffusion pseudotime metric between observations, computed from similarity

Note

  • This entails computing the augmented symmetric transitions.

  • \(T\) is the symmetric transition matrix

  • \(M(x,z) = \sum_{i=1}^{n-1} (\lambda_i * (1 - \lambda_i))\psi_i(x)\psi_i^T(z)\)

  • \(dpt(x,z) = ||M(x, .) - M(y, .)||^2\)

Parameters:
  • similarity_key (str) – Reference key to the numpy.ndarray, (n_observations, n_observations) symmetric similarity measure (with 1s on the diagonal) stored in the similarities in the keeper.

  • density_normalize (bool) – The density rescaling of Coifman and Lafon (2006): Then only the geometry of the data matters, not the sampled density.

  • n_comps (int) – Number of eigenvalues/vectors to be computed, set n_comps = 0 to compute the whole spectrum. Alternatively, if set n_comps >= n_observations, the whole spectrum will be computed.

Returns:

transitions_asymnumpy.ndarray, (n_observations, n_observations)

Asymmetric Transition matrix (with 0s on the diagonal) added to keeper.misc[f"transitions_asym_{similarity_key}"].

transitions_symnumpy.ndarray, (n_observations, n_observations)

Symmetric Transition matrix (with 0s on the diagonal) added to keeper.misc[f"transitions_sym_{similarity_key}"].

dptnumpy.ndarray, (n_observations, n_observations)

Pairwise-observation Diffusion pseudotime distances are stored in keeper.distances[dpt_key] where dpt_key="dpt_from_transitions_asym_{similarity_key}". If the full spectrum is not used (i.e., 0 < n_comps < n_observations"), then dpt_key="dpt_from_transitions_asym_{similarity_key}_{n_comps}comps".

Return type:

The following are stored in the keeper

compute_multiscale_VNE_transition_from_similarity(similarity_key, tau_max=None, do_save=True)[source]#

Compute the multi-scale transition matrix based on the elbow of the Von Neumann Entropy (VNE)

as described in GSPA and PHATE KrishnaswamyLab/spARC, https://pdfs.semanticscholar.org/16ab/e92b7630d5b84b904bde97dad9b9fbce406c.pdf.

Parameters:
  • similarity_key (str) – Reference key to the numpy.ndarray, (n_observations, n_observations) symmetric similarity measure (with 1s on the diagonal) stored in the similarities in the keeper.

  • tau_max (int) – Max scale tau tested for VNE (default is 100).

  • do_save (bool) – If True, save to keeper.

Returns:

  • P (numpy.ndarray (n_observations, n_observations)) – The symmetric VNE multi-scale transition matrix (with 0s on the diagonals). If do_save is True, P is added to the keeper.misc with the key 'transitions_sym_multiscaleVNE_{similarity_key}'

  • P_asym (numpy.ndarray (n_observations, n_observations)) – The random-walk VNE multi-scale transition matrix (with 0s on the diagonals). If do_save is True, P_asym is added to the keeper.misc with the key 'transitions_multiscaleVNE_{similarity_key}'

compute_rw_transitions_from_similarity(similarity_key)[source]#

Compute the row-stochastic transition matrix and store in keeper.

Parameters:

similarity_key (str) – Reference key to the numpy.ndarray, (n_observations, n_observations) symmetric similarity measure (with 1s on the diagonal) stored in the similarities in the keeper.

Returns:

transitions_rw_{similarity_key}numpy.ndarray, (n_observations, n_observations)

Asymmetric random walk transition matrix.

Return type:

Adds the following to keeper.misc (with 0s on the diagonals)

compute_sigmas(distance_key, label=None, n_neighbors=None, method='max', return_nn=False)[source]#

Set sigma for each obs as the distance to its k-th neighbor from keeper.

Parameters:
  • distance_key (str) – The label used to reference the distance matrix stored in keeper.distances, of size (n_observations, n_observations).

  • label ({None, str}) –

    If provided, this is appended to the context tag (tag = f"{method}{n_neighbors}nn_{distance_key}"). The key used to store the results defaults to the tag when label is not provided: key = tag. Otherwise, the key is set to: key = tag + "-" + label. The resulting sigmas are stored in keeper.misc['sigmas_' + key].

    If return_nn is True, nearest neighbor indices are stored in keeper.misc['nn_indices_' + key] and nearest neighbor distances are stored in keeper.misc['nn_distances_' + key].

  • n_neighbors ({int, None}) – K-th nearest neighbor (or number of nearest neighbors) to use for computing sigmas, n_neighbors > 0. (Uses n_neighbors + 1, since each obs is it’s closest neighbor). If None, all neighbors are used.

  • method ({'mean', 'median', 'max'}) –

    Indicate how to compute sigma.

    Options:

    • ’mean’ : mean of distance to n_neighbors nearest neighbors

    • ’median’ : median of distance to n_neighbors nearest neighbors

    • ’max’ : distance to n_neighbors-nearest neighbor

  • return_nn (bool) – If True, also store indices and distances of n_neighbors nearest neighbors.

Returns:

  • sigmas (numpy.ndarray, (n_observations, )) – The distance to the k-th nearest neighbor for all rows in d. Sigmas represent the kernel width representing each data point’s accessible neighbors. Written to keeper.misc['sigmas_' + key].

  • indices (numpy.ndarray, (n_observations, )) – Indices of nearest neighbors where each row corresponds to an observation. Written, if return_nn is True, to keeper.misc['nn_indices_' + key].

  • distances (numpy.ndarray, (n_observations, n_neighbors + 1)) – Distances to nearest neighbors where each row corresponds to an obs. Written, if return_nn is True, to keeper.misc['nn_distances_' + key].

compute_similarity_from_distance(distance_key, n_neighbors, method, precomputed_method=None, label=None, knn=False)[source]#

Convert distance matrix to symmetric similarity measure.

The resulting similarity is written to the similarity keeper.

Parameters:
  • distance_key (str) – The label used to reference the distance matrix stored in keeper.distances, of size (n_observations, n_observations).

  • n_neighbors ({int, None}) – K-th nearest neighbor (or number of nearest neighbors) to use for computing sigmas, n_neighbors > 0. (Uses n_neighbors + 1, since each obs is it’s closest neighbor). If None, all neighbors are used.

  • method ({float, ‘mean’, ‘median’, ‘max’, ‘precomputed’}) –

    Indicate how to compute sigma.

    Options:

    • float : constant float to use as sigma

    • int : constant int to use as sigma

    • ’mean’ : mean of distance to n_neighbors nearest neighbors

    • ’median’ : median of distance to n_neighbors nearest neighbors

    • ’max’ : distance to n_neighbors-nearest neighbor

    • ’precomputed’ : precomputed values extracted from keeper.misc[f"sigmas_{key}"] as a numpy.ndarray of size (n_observations, ).

  • precomputed_method ({'mean', 'median', 'max'}) – This is ignored if method is not ‘precomputed’. When method is ‘precomputed’, specify the method that was previously used for computing sigmas. See method for description of options.

  • label ({None, str}) – If provided, this is appended to the context tag (tag = f"{method}{n_neighbors}nn_{distance_key}") The key used to store the resulting similarity matrix of size (n_observations, n_observations) in keeper.similarities[f"similarity_{key}] defaults to the tag when label is not provided: key = tag. Otherwise, the key is set to: key = tag + "-" + label.

  • knn (bool) – If True, restrict similarity measure to be non-zero only between n_neighbors nearest neighbors.

Returns:

K – Symmetric similarity measure. Written to keeper.similarities[key].

Return type:

numpy.ndarray, (n_observations, n_observations)

compute_transitions_from_similarity(similarity_key, density_normalize: bool = True)[source]#

Compute symmetric and asymmetric transition matrices and store in keeper.

Note

Code primarily copied from scanpy.neighbors.

Parameters:
  • similarity_key (str) – Reference key to the numpy.ndarray, (n_observations, n_observations) symmetric similarity measure (with 1s on the diagonal) stored in the similarities in the keeper.

  • density_normalize (bool) – The density rescaling of Coifman and Lafon (2006): Then only the geometry of the data matters, not the sampled density.

Returns:

transitions_asym_{similarity_key}numpy.ndarray, (n_observations, n_observations)

Asymmetric Transition matrix.

Transitions_sym_{similarity_key}numpy.ndarray, (n_observations, n_observations)

Symmetric Transition matrix.

Return type:

Adds the following to keeper.misc (with 0s on the diagonals)

construct_pose(key, root=None, root_as_tip=False, min_branch_size=5, choose_largest_segment=False, flavor='haghverdi16', allow_kendall_tau_shift=False, smooth_corr=True, brute=True, split=True, verbose=None, n_branches=2, until_branched=False, mutual=False, k_mnn=3, connect_closest=False, connect_trunk='classic', annotate=True)[source]#

Construct the POSE from specified distance.

Parameters:
  • key (str) – The label used to reference the distance matrix stored in keeper.distances, of size (n_observations, n_observations).

  • root ({None, int, ‘density’, ‘density_inv’, ‘ratio’}) –

    The root. If None, ‘density’ is used.

    Options:

    • int : index of observation

    • ’density’ : select observation with minimal distance-density

    • ’density_inv’ : select observation with maximal distance-density

    • ’ratio’ : select observation which leads to maximal triangular ratio distance

  • root_as_tip (bool) – If True, force first tip as the root. Defaults to False following scanpy implementation.

  • min_branch_size ({int, float}) – During recursive splitting of branches, only consider splitting a branch with at least min_branch_size > 2 data points. If a float, min_branch_size refers to the fraction of the total number of data points (0 < min_branch_size < 1).

  • choose_largest_segment (bool) – If True, select largest segment for branching.

  • flavor ({'haghverdi16', 'wolf17_tri', 'wolf17_bi', 'wolf17_bi_un'}) – Branching algorithm (based on scanpy implementation).

  • allow_kendall_tau_shift (bool) – If a very small branch is detected upon splitting, shift away from maximum correlation in Kendall tau criterion of [Haghverdi16] to stabilize the splitting.

  • smooth_corr (bool, default = False) – If True, smooth correlations before identifying cut points for branch splitting.

  • brute (bool) – If True, data points not associated with any branch upon split are combined with undecided (trunk) points. Otherwise, if False, they are treated as individual islands, not associated with any branch (and assigned branch index -1).

  • split (bool (default = True)) – if True, split segment into multiple branches. Otherwise, determine a single branching off of the main segment. This is ignored if flavor is not ‘haghverdi16’. If True, brute is ignored.

  • n_branches (int) – Number of branches to look for (n_branches > 0).

  • until_branched (bool) – If True, iteratively find segment to branch and perform branching until a segement is successfully branched or no branchable segments remain. Otherwise, if False, attempt to perform branching only once on the next potentially branchable segment. Note: This is only applicable when branching is being performed. If previous iterations of branching has already been performed, it is not possible to identify the number of iterations where no branching was performed.

  • mutual (bool (default = False)) – If True, add k_mnn mutual nn edges. Otherwise, add single nn edge. When False, k_mnn is ignored.

  • k_mnn (int (0 < k_mnn < len(G))) – The number of nns to consider when extracting mutual nns. Note, this is ignored when mutual is False.

  • connect_closest (bool (default = False)) – If True, connect branches by points with smallest distance between the branches. Otherwise, connect by continuum of ordering.

  • connect_trunk ({'classic', 'endpoint', 'dual'}, default = 'classic') –

    Specify how to connect segments to unresolved/unidentified trunk. Note, this only applies when a split results in a trunk consisting of unresolved/unidentified points. Additionally, this is ignored if flavor ~= 'haghverdi16'. It is also ignored If flavor = `haghverdi16' and split = False.

    Options:

    • classic : point identified in trunk is connected to the point in the segment closest to it

    • endpoint : point identified in trunk is connected to the the segment’s second tip

    • dual : point identified in trunk is connected to both points determined by classic and endpoint

  • annotate (bool) – If True, annotate edges and nodes with POSE features.

Returns:

  • poser (netflow.pose.POSER) – The object used to construct the POSE.

  • G_poser_nn (networkx.Graph) – The updated graph with nearest neighbor edges. If annotate is True, edge attribute “edge_origin” is added with the possible values :

    • ”POSE” : for edges in the original graph that are not nearest neighbor edges

    • ”NN” : for nearest neighbor edges that were not in the original graph

    • ”POSE + NN” : for edges in the original graph that are also nearest neighbor edges

convert_similarity_to_distance(similarity_key)[source]#

Convert a similarity to a distance.

The distance, computed as 1-similarity, is added to the distance keeper with the key "distance_from_" + similarity_key

Parameters:

similarity_key (str) – The similiratiy reference key.

Returns:

dThe new distance is saved to the keeper in

keeper.distances[f"distance_from_{similarity_key}"].

Return type:

The following are saved to the distance keeper

property data#

The feature data sets.

distance_density(label)[source]#

Compute each observation’s density from a distance.

The density of an observation is its net distance to all other observations.

Parameters:

label (str) – The reference label for the distance.

Returns:

density – The densities indexed by the observation labels.

Return type:

pandas.Series

distance_density_argmax(label)[source]#

Find the observation with the largest density from a distance.

The density of an observation is its net distance to all other observations.

Parameters:

label (str) – The reference label for the distance.

Returns:

obs – The index of the observation with the largest density.

Return type:

int

distance_density_argmin(label)[source]#

Find the observation with the smallest density from a distance.

The density of an observation is its net distance to all other observations.

Parameters:

label (str) – The reference label for the distance.

Returns:

obs – The index of the observation with the smallest density.

Return type:

int

property distances#

The distances.

euc_distance_pairwise_observation_feature_nbhd(data_key, graph_key, features=None, include_self=False, label=None, metric='euclidean', normalize=False, **kwargs)[source]#

Compute Euclidean distances between feature neighborhoods of every two observations

Note

If object.outdir is not None, Euclidean distances are saved to file every 10 iterations. Before starting the computation, check if the file exists. If so, load and remove already computed nodes from the iteration. Euclidean distances are computed for the remaining nodes, combined with the previously computed and saved results before saving and returning the combined results.

Only nodes with at least 2 neighbors are included, as leaf nodes will all have the same Euclidean distance and do not provide any further information.

The resulting observation-pairwise Euclidean distances are saved to misc (aka self.misc) and can be accessed by self.misc[f"{data_key}_{label}_nbhd_euc_with{'' if include_self else 'out'}_self"].

Parameters:
  • data_key (str) – The key to the data in the data keeper that should be used.

  • graph_key ('str') – The key to the graph in the graph keeper that should be used. (Does not have to include all features in the data)

  • features ({None, list [str])}) – List of features (nodes) to compute neighborhood distances on. If None, all features are used.

  • include_self (bool) – If True, add node in neighborhood which will result in computing normalized profile over the neighborhood. If False, node is not included in neighborhood which results in computing the transition distribution over the neighborhood.

  • label (str) – Label that resulting Wasserstein distances are saved in keeper.misc and name of file to store stacked results.

  • metric (str) – The metric used to compute the distance, passed to scipy.spatial.distance.cdist.

  • normalize (bool) – If True, normalize neighborhood profiles to sum to 1.

  • **kwargs (dict) – Extra arguments to metric, passed to scipy.spatial.distance.cdist.

Returns:

eds – Euclidean distances between pairwise observations where rows are observation-pairs and columns are feature (node) names. saved in keeper.misc with the key f"{data_key}_{label}_nbhd_euc_with{'' if include_self else 'out'}_self".

Return type:

pandas.DataFrame

euc_distance_pairwise_observation_profile(data_key, features=None, label=None, metric='euclidean', normalize=False, **kwargs)[source]#

Compute Euclidean distances between feature profiles of every two observations

Note

If object.outdir is not None, Euclidean distances are saved to file. Before starting the computation, check if the file exists. If so, load and remove already computed nodes from the iteration. Euclidean distances are computed for the remaining nodes, combined with the previously computed and saved results before saving and returning the combined results.

Only nodes with at least 2 neighbors are included, as leaf nodes will all have the same Wassserstein distance and do not provide any further information.

The resulting observation-pairwise Wasserstein distances are saved to the DistanceKeeper (aka self.distances) and can be accessed by self.distances[f'{data_key}_{label}_profile_euc'].

Parameters:
  • data_key (str) – The key to the data in the data keeper that should be used.

  • features ({None, list [str])}) – List of features to compute profile distances on. If None, all features are used.

  • label (str) – Label that resulting Wasserstein distances are saved in keeper.distances and name of file to store stacked results.

  • metric (str) – The metric used to compute the distance, passed to scipy.spatial.distance.cdist.

  • normalize (bool) – If True, normalize neighborhood profiles to sum to 1.

  • **kwargs (dict) – Extra arguments to metric, passed to scipy.spatial.distance.cdist.

Returns:

eds – Euclidean distances between pairwise profiles where rows are observation-pairs and columns are node names, saved in keeper.distances with the key f'{data_key}_{label}_profile_euc'.

Return type:

pandas.DataFrame

fuse_similarities(similarity_keys, weights=None, fused_key=None)[source]#

Fuse similarities in the keeper

Parameters:
  • similarity_keys (list) – Reference keys of similiraties to fuse.

  • weights (list) – (Optional) Weight each similarity contributes to the fused similarity. Should be the same length as similarity_keys. If not provided, default behavior is to apply uniform weights.

  • fused_key (str) – (Optional) Specify key used to store the fused similarity in the keeper. Default behavior is to fuse the keys of the original similarities.

Returns:

  • fused_sim : The fused similarity, where the reference key, if not provided, is fused from the original labels.

Return type:

The following is added to the similarity keeper

property graphs#

The networks.

integrate_multiscale_VNE_transitions_from_similarities(similarity_keys, tau_max=None, integrated_key=None)[source]#

Integrate multi-scale transitions where scale is determined by elbow of Von Neumann Entropy (VNE)

As described in https://pdfs.semanticscholar.org/16ab/e92b7630d5b84b904bde97dad9b9fbce406c.pdf.

Parameters:
  • similarity_keys (list) – Reference keys of similarities to compute transitions and integrate.

  • tau_max (int) – Max scale tau to test Von Neumann Entropy on (default is 100).

  • integrated_key (str) – (Optional) Specify key used to store the integrated transitions into the keeper. Default behavior is to integrate the keys of the original transitions.

Returns:

  • integrated transition : The (symmetric) integrated multi-scale transition, where the reference key, if not provided, is fused from the original labels.

  • integrated asymmetric transition : The (asymmetric random-walk) integrated multi-scale transition, where the reference key, if not provided, is fused from the original labels.

Return type:

The following is added to the similarity keeper

integrate_transitions(transition_keys, integrated_key=None)[source]#

Integrate transitions

Parameters:
  • transition_keys (list) – Reference keys of transitions to integrate.

  • integrated_key (str) – (Optional) Specify key used to store the integrated transitions into the keeper. Default behavior is to integrate the keys of the original transitions.

Returns:

  • integrated transition : The integrated transition, where the reference key, if not provided, is fused from the original labels.

Return type:

The following is added to the similarity keeper

integrate_transitions_from_similarity(similarity_keys, integrated_key=None, density_normalize: bool = True, n_comps: int = 0)[source]#

Compute the integrated transition from similarities,

Note

  • This entails computing the augmented symmetric transitions.

  • \(T\) is the symmetric transition matrix

Parameters:
  • similarity_keys (list) – Reference keys of similarities to compute transitions and integrate.

  • integrated_key (str) – (Otional) Specify key used to store the integrated transitions into the keeper. Default behavior is to integrate the keys of the original transitions.

  • density_normalize (bool) – The density rescaling of Coifman and Lafon (2006): Then only the geometry of the data matters, not the sampled density.

  • n_comps (int) – Number of eigenvalues/vectors to be computed, set n_comps = 0 to compute the whole spectrum. Alternatively, if set n_comps >= n_observations, the whole spectrum will be computed.

Returns:

  • transitions_asymnumpy.ndarray, (n_observations, n_observations)

    Asymmetric Transition matrix (with 0s on the diagonal) added to keeper.misc[f"transitions_asym_{similarity_key}"].

  • transitions_symnumpy.ndarray, (n_observations, n_observations)

    Symmetric Transition matrix (with 0s on the diagonal) added to keeper.misc[f"transitions_sym_{similarity_key}"].

  • transitions_i : The integrated transition, where the reference key, if not provided, is fused from the original labels.

Return type:

The following are stored in the keeper

load_data(file_name, label='data', file_path=None, file_format=None, delimiter=',', dtype=None, cols_as_obs=True, **kwargs)[source]#

Load data from file into the keeper.

Note

Currently loads data using pandas.read_csv. Additional formats will be added in the future.

Parameters:
  • file_name ({str, pathlib.Path}) – Input data file name.

  • label (str, (default: ‘data’)) – Reference label describing the data set.

  • file_path ({str pathlib.Path}, optional (default: None)) – File path. Empty string by default

  • file_format (str, optional (default: None)) – File format. Currently supported file formats: ‘txt’, ‘csv’, ‘tsv’. If None, file_format will be inferred from the file extension in file_name. Currently, this is ignored.

  • delimiter (str, optional (default: ‘,’)) – Delimiter to use.

  • dtype – If provided, ensure to convert data type after loaded.

  • cols_as_obs (bool (default = True)) – If True, columns in the loaded data are observations, otherwise, the rows are observations.

  • **kwargs – Additional key-word arguments passed to pandas.read_csv.

load_distance(file_name, label='distance', file_path=None, file_format=None, delimiter=',', **kwargs)[source]#

Load distance from file into the keeper.

Note

Assumed that the distance array is stored with the first row and first column as the index and header, respectively.

Currently loads data using pandas.read_csv. Additional formats will be added in the future.

Parameters:
  • file_name ({str, pathlib.Path}) – Input distance file name.

  • label (str, (default: ‘distance’)) – Reference label describing the data set.

  • file_path ({str pathlib.Path}, optional (default: None)) – File path. Empty string by default

  • file_format (str, optional (default: None)) – File format. Currently supported file formats: ‘txt’, ‘csv’, ‘tsv’. If None, file_format will be inferred from the file extension in file_name. Currently, this is ignored.

  • delimiter (str, optional (default: ‘,’)) – Delimiter to use.

  • **kwargs – Additional key-word arguments passed to pandas.read_csv.

load_graph(file_name, label='graph', file_path=None, file_format=None, delimiter=',', source='source', target='target', **kwargs)[source]#

Load network (edgelist) from file into graph and store in the keeper.

Note

Currently loads graph from edgelist. Future release will allow different graph types (e.g., adjacency, graphml).

Assumed that the edge-list is stored as two columns, where the first row is labeled as source and target.

Currently loads data using pandas.read_csv. Additional formats will be added in the future.

Parameters:
  • file_name ({str, pathlib.Path}) – Input edge-list file name.

  • label (str, (default: ‘graph’)) – Reference label describing the network.

  • file_path ({str pathlib.Path}, optional (default: None)) – File path. Empty string by default

  • file_format (str, optional (default: None)) – File format. Currently supported file formats: ‘txt’, ‘csv’, ‘tsv’. If None, file_format will be inferred from the file extension in file_name. Currently, this is ignored.

  • delimiter (str, optional (default: ‘,’)) – Delimiter to use.

  • source ({str, int} (default: ‘source’)) – A valid column name (string or integer) for the source nodes passed to networkx.from_pandas_edgelist.

  • target ({str, int} (default: ‘target’)) – A valid column name (string or integer) for the target nodes passed to networkx.from_pandas_edgelist.

  • **kwargs – Additional key-word arguments passed to pandas.read_csv.

load_similarity(file_name, label='similarity', file_path=None, file_format=None, delimiter=',', **kwargs)[source]#

Load similarity from file into the keeper.

Note

Assumed that the distance array is stored with the first row and first column as the index and header, respectively.

Currently loads data using pandas.read_csv. Additional formats will be added in the future.

Parameters:
  • file_name ({str, pathlib.Path}) – Input similarity file name.

  • label (str, (default: ‘similarity’)) – Reference label describing the data set.

  • file_path ({str pathlib.Path}, optional (default: None)) – File path. Empty string by default

  • file_format (str, optional (default: None)) – File format. Currently supported file formats: ‘txt’, ‘csv’, ‘tsv’. If None, file_format will be inferred from the file extension in file_name. Currently, this is ignored.

  • delimiter (str, optional (default: ‘,’)) – Delimiter to use.

  • **kwargs – Additional key-word arguments passed to pandas.read_csv.

load_stacked_distance(file_name, label='distance', file_path=None, file_format=None, delimiter=',', **kwargs)[source]#

Load distance in stacked form from file, convert to unstacked form and store in the keeper.

Note

Assumed that the stacked distances are stored with a 2-multi-index of the pairwise-observattion (excluding self-pairs) and a single column with the pairwise distances.

Currently loads data using pandas.read_csv. Additional formats will be added in the future.

Parameters:
  • file_name ({str, pathlib.Path}) – Input distance file name.

  • label (str, (default: ‘distance’)) – Reference label describing the data set.

  • file_path ({str pathlib.Path}, optional (default: None)) – File path. Empty string by default

  • file_format (str, optional (default: None)) – File format. Currently supported file formats: ‘txt’, ‘csv’, ‘tsv’. If None, file_format will be inferred from the file extension in file_name. Currently, this is ignored.

  • delimiter (str, optional (default: ‘,’)) – Delimiter to use.

  • **kwargs – Additional key-word arguments passed to pandas.read_csv.

load_stacked_similarity(file_name, label='similarity', diag=1.0, file_path=None, file_format=None, delimiter=',', **kwargs)[source]#

Load similarity in stacked form from file, convert to unstacked form and store in the keeper.

Note

Assumed that the stacked distances are stored with a 2-multi-index of the pairwise-observattion (excluding self-pairs) and a single column with the pairwise distances.

Currently loads data using pandas.read_csv. Additional formats will be added in the future.

Parameters:
  • file_name ({str, pathlib.Path}) – Input distance file name.

  • label (str, (default: ‘distance’)) – Reference label describing the data set.

  • diag (float) – Value used on diagonal.

  • file_path ({str pathlib.Path}, optional (default: None)) – File path. Empty string by default

  • file_format (str, optional (default: None)) – File format. Currently supported file formats: ‘txt’, ‘csv’, ‘tsv’. If None, file_format will be inferred from the file extension in file_name. Currently, this is ignored.

  • delimiter (str, optional (default: ‘,’)) – Delimiter to use.

  • **kwargs – Additional key-word arguments passed to pandas.read_csv.

log1p(key, base=None)[source]#

Logarithmic data transformation.

Computes \(data = \log(data + 1)\) with the natural logarithm as the default base.

Parameters:
  • key (str) – The reference key of the data in the data-keeper that will be logarithmically transformed.

  • base ({None, int}) – Base used for the logarithmic transformation.

Return type:

Logarithmically transformed data with label “{key}_log1p” is added to the data keeper.

property misc#

The misc data.

property num_observations#

The number of observations.

observation_index(observation_label)[source]#

Return index of observation.

Parameters:

observation_label (str) – The observation label.

Returns:

observation_index – Index of observation in list of observation labels.

Return type:

int

property observation_labels#

Labels for each observation.

save_data(label, file_format='txt', delimiter=',', **kwargs)[source]#

Save data to file.

Note

This currently only saves a pandas DataFrame to .txt, .csv, or .tsv. Future releases will allow for other formats.

Data set is saved to the file named ‘{self.outdir}/data_{label}.{file_format}’.

Parameters:
  • label (str) – Reference label describing which data set to save.

  • file_format (str) – File format. Currently supported file formats: ‘txt’, ‘csv’, ‘tsv’.

  • delimiter (str, optional (default: ‘,’)) – Delimiter to use.

  • **kwargs – Additional key-word arguments passed to pandas.to_csv.

save_distance(label, file_format='txt', delimiter=',', **kwargs)[source]#

Save distance to file.

Note

This currently only saves a pandas DataFrame to .txt, .csv, or .tsv. Future releases will allow for other formats.

Distance is saved to the file named ‘{self.outdir}/distance_{label}.{file_format}’.

Parameters:
  • label (str) – Reference label describing which distance to save.

  • file_format (str) – File format. Currently supported file formats: ‘txt’, ‘csv’, ‘tsv’.

  • delimiter (str, optional (default: ‘,’)) – Delimiter to use.

  • **kwargs – Additional key-word arguments passed to pandas.to_csv.

save_misc(label, file_format='txt', delimiter=',', **kwargs)[source]#

Save misc data to file.

Note

This currently only saves a pandas DataFrame to .txt, .csv, or .tsv. Future releases will allow for other formats.

Misc data is saved to the file named ‘{self.outdir}/misc_{label}.{file_format}’.

Parameters:
  • label (str) – Reference label describing which misc data to save.

  • file_format (str) – File format. Currently supported file formats: ‘txt’, ‘csv’, ‘tsv’.

  • delimiter (str, optional (default: ‘,’)) – Delimiter to use.

  • **kwargs – Additional key-word arguments passed to pandas.to_csv.

save_similarity(label, file_format='txt', delimiter=',', **kwargs)[source]#

Save similarity to file.

Note

This currently only saves a pandas DataFrame to .txt, .csv, or .tsv. Future releases will allow for other formats.

Similarity is saved to the file named ‘{self.outdir}/similarity_{label}.{file_format}’.

Parameters:
  • label (str) – Reference label describing which similarity to save.

  • file_format (str) – File format. Currently supported file formats: ‘txt’, ‘csv’, ‘tsv’.

  • delimiter (str, optional (default: ‘,’)) – Delimiter to use.

  • **kwargs – Additional key-word arguments passed to pandas.to_csv.

property similarities#

The similarities.

similarity_density(label)[source]#

Compute each observation’s density from a similarity.

The density of an observation is its net similarity to all other observations.

Parameters:

label (str) – The reference label for the similarity.

Returns:

density – The densities indexed by the observation labels.

Return type:

pandas.Series

similarity_density_argmax(label)[source]#

Find the observation with the largest density from a similarity.

The density of an observation is its net similarity to all other observations.

Parameters:

label (str) – The reference label for the similarity.

Returns:

obs – The label of the observation with the largest density.

Return type:

str

standardize(key, label=None, **kwargs)[source]#

Standardize features in DataKeeper by removing the mean and scaling to unit variance

Parameters:
  • key (str) – The reference key of the data in the data-keeper that will be standardized.

  • label ({str, None}) – The label used to store the standardized data. If None, default label f'{key}_z' is used.

  • kwargs – Keyword arguments passed to sklearn.preprocessing.StandardScalar.

Returns:

data_z – The standardized data.

Return type:

pandas.DataFrame

subset(observations, keep_misc=False, keep_graphs=False, outdir=None)[source]#

Return a new instance of Keeper restricted to subset of observations.

The default behavior is to not include misc or graphs in the Keeper subset. This is because there is no check for which observations the misc and graphs correspond to.

Warning: The subset keeper and all data it contains is not a copy.

Parameters:
  • observations (list) –

    List of observations to include in the subset. This is treated differently depending on the type of observation labels :

    • If self._observation_labels is List [str], observations can be of the form:

      • List [str], to reference observations by their str label or;

      • List [int], to reference observations by their location index.

    • If self._observation_labels is None or List [int], observations must be of the form List [int], to reference observations by their location index.

  • keep_misc (bool) – If True, misc is added to the new Keeper.

  • keep_graphs (bool) – If True, the graphs are added to the new Keeper.

  • outdir ({None, str pathlib.Path}) – Global path where any results will be saved. If None, no results will be saved.

Returns:

keeper_subset – A Keeper object restricted to the selected observations.

Return type:

Keeper

wass_distance_pairwise_observation_feature_nbhd(data_key, graph_key, features=None, include_self=False, label=None, graph_distances=None, edge_weight=None, proc=4, chunksize=None, measure_cutoff=1e-06, solvr=None)[source]#

Compute Wasserstein distances between feature neighborhoods of every two observations

Note

If object.outdir is not None, Wasserstein distances are saved to file every 10 iterations. Before starting the computation, check if the file exists. If so, load and remove already computed nodes from the iteration. Wasserstein distances are computed for the remaining nodes, combined with the previously computed and saved results before saving and returning the combined results.

Only nodes with at least 2 neighbors are included, as leaf nodes will all have the same Wassserstein distance and do not provide any further information.

The resulting observation-pairwise Wasserstein distances are saved to misc (aka self.misc) and can be accessed by self.misc[f"{data_key}_{label}_nbhd_wass_with{'' if include_self else 'out'}_self"].

Parameters:
  • data_key (str) – The key to the data in the data keeper that should be used.

  • graph_key ('str') – The key to the graph in the graph keeper that should be used. (Does not have to include all features in the data)

  • features ({None, list [str])}) – List of features (nodes) to compute neighborhood distances on. If None, all features are used.

  • include_self (bool) – If True, add node in neighborhood which will result in computing normalized profile over the neighborhood. If False, node is not included in neighborhood which results in computing the transition distribution over the neighborhood.

  • label (str) – Label that resulting Wasserstein distances are saved in keeper.misc and name of file to store stacked results.

  • graph_distances (numpy.ndarray, (n, n)) – A matrix of node-pairwise graph distances between the \(n\) nodes (ordered from \(0, 1, ..., n-1\)). If None, use hop distance.

  • edge_weight ({None, str}) – The edge attribute used as the weight for computed the graph distances. This is ignored if graph_distances is provided, If None, no edge weight is used.

  • measure_cutoff (float) – Threshold for treating values in profiles as zero, default = 1e-6.

  • proc (int) – Number of processor used for multiprocessing. (Default value = cpu_count()).

  • chunksize (int) – Chunksize to allocate for multiprocessing.

  • solvr (str) – Solver to pass to POT library for computing Wasserstein distance.

Returns:

wds – Wasserstein distances between pairwise observations where rows are observation-pairs and columns are feature (node) names. saved in keeper.misc with the key f"{data_key}_{label}_nbhd_wass_with{'' if include_self else 'out'}_self".

Return type:

pandas.DataFrame

wass_distance_pairwise_observation_profile(data_key, graph_key, features=None, label=None, graph_distances=None, edge_weight=None, proc=4, chunksize=None, measure_cutoff=1e-06, solvr=None)[source]#

Compute Wasserstein distances between feature profiles of every two observations

Note

If object.outdir is not None, Wasserstein distances are saved to file every 10 iterations. Before starting the computation, check if the file exists. If so, load and remove already computed nodes from the iteration. Wasserstein distances are computed for the remaining nodes, combined with the previously computed and saved results before saving and returning the combined results.

Only nodes with at least 2 neighbors are included, as leaf nodes will all have the same Wassserstein distance and do not provide any further information.

The resulting observation-pairwise Wasserstein distances are saved to the DistanceKeeper (aka self.distances) and can be accessed by self.distances[f'{data_key}_{label}_wass_dist_observation_pairwise_profiles'].

Parameters:
  • data_key (str) – The key to the data in the data keeper that should be used.

  • graph_key ('str') – The key to the graph in the graph keeper that should be used. (Does not have to include all features in the data)

  • features ({None, list [str])}) – List of features to compute profile distances on. If None, all features are used.

  • label (str) – Label that resulting Wasserstein distances are saved in keeper.distances and name of file to store stacked results.

  • graph_distances (numpy.ndarray, (n, n)) – A matrix of node-pairwise graph distances between the \(n\) nodes (ordered from \(0, 1, ..., n-1\)). If None, use hop distance.

  • edge_weight ({None, str}) – The edge attribute used as the weight for computed the graph distances. This is ignored if graph_distances is provided, If None, no edge weight is used.

  • measure_cutoff (float) – Threshold for treating values in profiles as zero, default = 1e-6.

  • proc (int) – Number of processor used for multiprocessing. (Default value = cpu_count()).

  • chunksize (int) – Chunksize to allocate for multiprocessing.

  • solvr (str) – Solver to pass to POT library for computing Wasserstein distance.

Returns:

wds – Wasserstein distances between pairwise profiles where rows are observation-pairs and columns are node names, saved in keeper.distances with the key f'{data_key}_{label}_profile_wass',

Return type:

pandas.DataFrame