netflow.keepers.keeper#
keeper#
Classes used for data storage and manipulation.
Classes
|
A class to store and handle multiple data sets. |
|
A class to extract single data view from DataKeeper for analysis. |
|
A class to store and handle multiple distances. |
|
A class to extract single distance matrix view from DistanceKeeper for analysis. |
|
A class to store and handle multiple netowrks. |
|
A class to store and handle data, distances and similarities between data points (or observations), and miscellaneous related results. |
- class netflow.keepers.keeper.DataKeeper(data=None, observation_labels=None)[source]#
A class to store and handle multiple data sets.
- Parameters:
data ({numpy.ndarray, pandas.DataFrame, dict [str, numpy.ndarray], dict [str, pandas.DataFrame]}) –
One or multiple feature data sets, where each data set is size (num_features, num_observations).
Feature data set(s) may be provided in multiple ways:
numpy.ndarray : A single feature data set.
Observation labels may be specified in
observation_labels
.To include feature labels,
data
should be a pandas.DataFrame.The array
data
is placed in a dict with default label of the form{'data' : data}
.
pandas.DataFrame : A single feature data set.
The index indicates feature labels.
Columns indicate observation labels.
If
observation_labels
is provided, it should match the columns of the dataframe, that will be ordered according to theobservation_labels
.The array
data.values
is placed in a dict with default label of the form{'data' : data.values}
.
- dict [str, numpy.ndarray]A single or mutliple feature data set(s) may be provided as
the value(s) of a dict keyed by a str that serves as the feature data descriptor or label for each input of the form
{'data_label' : `numpy.ndarray`}
.
All arrays are expected to have the same number of columns corresponding to the number of observations with the same ordering.
Observation labels may be specified in
observation_labels
.To include feature labels, pass a dict of pandas.DataFrames instaed.
- dict [str, pandas.DataFrame]A single or mutliple feature data set(s) may be provided as
the value(s) of a dict keyed by a str that serves as the feature data descriptor or label for each input of the form
{'data_label' : `pandas.DataFrame`}
.
All dataframes are expected to have the same number of columns corresponding to the number of observations with the same columns provided in the same order.
The index of each input indicates feature labels.
Columns indicate observation labels.
If
observation_labels
is provided, it should match the columns of the dataframe(s), that will be ordered according to theobservation_labels
.
observation_labels (list [str], optional) –
List of labels for each observation with length equal to num_observations. If provided when
data
is a pandas.DataFrame or dict [str, pandas.DataFrame], it should match the columns of the dataframe(s), which will be ordered according toobservation_labels
.If not provided and
data
is a :pandas.DataFrame or dict [str, pandas.DataFrame], then the columns of the (first) dataframe are used.
numpy.ndarray or dict [str, numpy.ndarray], then the columns are labeled by their row index from \(0, 1, ..., num\_observations - 1\).
Notes
All data sets are assumed to contain the same set of data points in the same order.
- __module__ = 'netflow.keepers.keeper'#
- add_data(data, label)[source]#
Add a feature data set to the keeper.
Note
Observation labels may be set after object initialization. This may require upkeep in
Keeper
to update and coordinate observation labels among the different keepers.- Parameters:
data ({numpy.ndarray, pandas.DataFrame}) – The data set of size (num_features, num_observations).
label (str) – Reference label describing the data set.
- property data#
A dictionary of all data sets.
- property features_labels#
A dictionary of feature labels for each data set.
- property num_features#
A dictionary with the number of features in each data set.
- property num_observations#
Number of observations.
- observation_index(observation_label)[source]#
Return index of observation.
- Parameters:
observation_label (str) – The observation label.
- Returns:
observation_index – Index of observation in list of observation labels.
- Return type:
int
- property observation_labels#
Labels for each observation.
- standardize(key, **kwargs)[source]#
Standardize features by removing the mean and scaling to unit variance
- Parameters:
key (str) – The reference key of the data in the data-keeper that will be standardized.
kwargs – Keyword arguments passed to sklearn.preprocessing.StandardScalar.
- Returns:
data_z – The standardized data.
- Return type:
pandas.DataFrame
- subset(observations)[source]#
Return a new instance of DataKeeper restricted to subset of observations
- Parameters:
observations (list,) –
List of observations to include in the subset. This is treated differently depending on the type of observation labels :
If
self._observation_labels
is List [str],observations
can be of the form:List [str], to reference observations by their str label or;
List [int], to reference observations by their location index.
If
self._observation_labels
is None or List [int],observations
must be of the form List [int], to reference observations by their location index.
- Returns:
data_subset – A DataKeeper object restricted to the selected observations.
- Return type:
DataKeeper
- class netflow.keepers.keeper.DataView(dkeeper=None, label=None)[source]#
A class to extract single data view from DataKeeper for analysis.
- Parameters:
dkeeper (DataKeeper) – The object to extract data from.
label (str) – The identifier of the data to be extracted from
dkeeper
.
- __module__ = 'netflow.keepers.keeper'#
- property data#
The data set.
- feature_index(feature_label)[source]#
Return index of feature.
- Parameters:
feature_label (str) – The feature label.
- Returns:
feature_index – Index of feature in list of feature labels.
- Return type:
int
- property feature_labels#
Feature labels.
- property label#
The data label.
- property num_features#
The number of features in the data set.
- property num_observations#
Number of observations.
- observation_index(observation_label)[source]#
Return index of observation.
- Parameters:
observation_label (str) – The observation label.
- Returns:
observation_index – Index of observation in list of observation labels.
- Return type:
int
- property observation_labels#
Labels for each observation.
- standardize(**kwargs)[source]#
Standardize features by removing the mean and scaling to unit variance :param kwargs: Keyword arguments passed to sklearn.preprocessing.StandardScalar.
- Returns:
data_z – The standardized data.
- Return type:
pandas.DataFrame
- subset(observations=None, features=None)[source]#
Return data for specified subset of observations and/or features.
- Parameters:
observations (list,) –
List of observations to include in the subset. This is treated differently depending on the type of observation labels :
If
self._observation_labels
is List [str],observations
can be of the form:List [str], to reference observations by their str label or;
List [int], to reference observations by their location index.
If
self._observation_labels
is None or List [int],observations
must be of the form List [int], to reference observations by their location index.
features (list) – List of features to include in the subset. The list type depends on the type of
self._feature_labels
, analogous toself._observation_labels
forobservations
above.
- Returns:
data – The data subset where the \(ij^{th}\) entry is the value of the \(i^{th}\) feature in
features
for the \(j^{th}\) observation inobservations
.- Return type:
pandas.DataFrame, (
len(observations)
,len(features)
)
- class netflow.keepers.keeper.DistanceKeeper(data=None, observation_labels=None)[source]#
A class to store and handle multiple distances.
This class may also be used to store and handle similarity matrices. Distance may be interchanged with similarity, but distance is used for simplicity.
- Parameters:
data ({numpy.ndarray, pandas.DataFrame, dict [str, numpy.ndarray], dict [str, pandas.DataFrame]}) –
One or multiple symmetric distance(s) between observations, where each distance matrix is size (num_observations, num_observations).
Distance(s) may be provided in multiple ways:
numpy.ndarray : A single distance matrix.
Observation labels may be specified in
observation_labels
.To include feature labels,
data
should be a pandas.DataFrame.The array
data
is placed in a dict with default label of the form{'distance' : data}
.
pandas.DataFrame : A single distance matrix.
The index should be the same as the columns, which indicate observation labels.
If
observation_labels
is provided, it should match the rows and columns of the dataframe, that will be ordered according to theobservation_labels
.The array
data.values
is placed in a dict with default label of the form{'distance' : data.values}
.
- dict [str, numpy.ndarray]A single or mutliple distance(s) may be provided as the value(s)
of a dict keyed by a str that serves as the distance descriptor or label for each input of the form
{'distance_label' : `numpy.ndarray`}
.
All arrays are expected to have the same number of columns and rows corresponding to the number of observations with the same ordering.
Observation labels may be specified in
observation_labels
.
- dict [str, pandas.DataFrame]A single or mutliple distance(s) may be provided as the value(s)
of a dict keyed by a str that serves as the distance descriptor or label for each input of the form
{'distance_label' : `pandas.DataFrame`}
.
All dataframes are expected to have the same number of rows and columns corresponding to the number of observations with the same columns provided in the same order.
The index and columns indicate observation labels.
If
observation_labels
is provided, it should match the columns of the dataframe(s), that will be ordered according to theobservation_labels
.
observation_labels (list [str]) –
List of labels for each observation with length equal to num_observations. If provided when
data
is a pandas.DataFrame or dict [str, pandas.DataFrame], it should match the columns of the dataframe(s), which will be ordered according toobservation_labels
.If not provided and
data
is a :pandas.DataFrame or dict [str, pandas.DataFrame], then the columns of the (first) dataframe are used.
numpy.ndarray or dict [str, numpy.ndarray], then the columns are labeled by their row index from \(0, 1, ..., num_observations - 1\).
Notes
All sets are assumed to contain the same set of data points in the same order.
- __module__ = 'netflow.keepers.keeper'#
- add_data(data, label)[source]#
Add a symmeric distance matrix to the keeper.
- Parameters:
data ({numpy.ndarray, pandas.DataFrame}) – The distance matrix of size (num_observations, num_observations).
label (str) – Reference label describing the input.
- add_stacked_data(data, label, diag=0.0)[source]#
Add a symmetric distance from stacked Series to the keeper.
- Parameters:
data (pandas.Series) – The stacked distances of size (num_observations * (num_observations - 1) / 2,) with a 2-multi-index of the pairwise observation labels.
label (str) – Reference label describing the input.
diag (float) – Value used on diagonal.
- property data#
A dictionary of all distances.
- property num_observations#
Number of observations.
- observation_index(observation_label)[source]#
Return index of observation.
- Parameters:
observation_label (str) – The observation label.
- Returns:
observation_index – Index of observation in list of observation labels.
- Return type:
int
- property observation_labels#
Labels for each observation.
- subset(observations)[source]#
Return a new instance of DistanceKeeper restricted to subset of observations
- Parameters:
observations (list,) –
List of observations to include in the subset. This is treated differently depending on the type of observation labels :
If
self._observation_labels
is List [str],observations
can be of the form:List [str], to reference observations by their str label or;
List [int], to reference observations by their location index.
If
self._observation_labels
is None or List [int],observations
must be of the form List [int], to reference observations by their location index.
- Returns:
distance_subset – A DistnaceKeeper object restricted to the selected observations.
- Return type:
DistanceKeeper
- class netflow.keepers.keeper.DistanceView(dkeeper=None, label=None)[source]#
A class to extract single distance matrix view from DistanceKeeper for analysis.
This class may also be used to extract a similarity matrix. Distance may be interchanged with similarity, but distance is used for simplicity.
- Parameters:
dkeeper (DistanceKeeper) – The object to extract the distance from.
label (str) – The identifier of the distance matrix to be extracted from
dkeeper
.
- __module__ = 'netflow.keepers.keeper'#
- property data#
The distance matrix.
- density()[source]#
Return density of each observation
The density of an observation is its net distance to all other observations. This should be minimized for distances and maximized for similarities.
- Returns:
d – The densities indexed by the observations.
- Return type:
pandas.Series
- property label#
The distance label.
- property num_observations#
Number of observations.
- observation_index(observation_label)[source]#
Return index of observation.
- Parameters:
observation_label (str) – The observation label.
- Returns:
observation_index – Index of observation in list of observation labels.
- Return type:
int
- property observation_labels#
Labels for each observation.
- subset(observations_a, observations_b=None)[source]#
Return subset of distances between
observations_a
andobservations_b
.- Parameters:
observations_a (list) –
Subset of observations to extract distances from, that make up the rows of the returned sub-distance matrix. This is treated differently depending on the type of observation labels :
If
self._observation_labels
is List [str],observations_a
can be of the form:List [str], to reference observations by their str label or;
List [int], to reference observations by their location index.
If
self._observation_labels
is None or List [int],observations
must be of the form List [int], to reference observations by their location index.
observations_b ({None, list}, optional) – Subset of observations to extract distances computed from
observations_a
. The list type depends on the type ofself._feature_labels
, analogous toself._observation_labels
forobservations_a
above. If None,observations_a
is used to create symmetric matrix.
- Returns:
distance – The sub-matrix of distances where the \(ij^{th}\) entry is the distance between the \(i^{th}\) observation in
observations_a
and the \(j^{th}\) observation inobservations_b
.- Return type:
pandas.DataFrame, (
len(observations_a)
,len(observations_b)
)
- class netflow.keepers.keeper.GraphKeeper(graphs=None)[source]#
A class to store and handle multiple netowrks.
- Parameters:
graphs ({networkx.Graph, dict [str’, `networkx.Graph]}) –
One or multiple networks.
The network(s) may be provided in multiple ways:
networkx.Graph : A single network.
The network is placed in a dict with default label of the form
{'graph' : graphs}
.To use a customized label for the network, provide the network as a dict, shown next.
dict [str, networkx.Graph] : A single or multiple networks may be provided as the value(s) of a dict keyed by a str that serves as the netwwork descriptor or label for each input of the form
{'graph_label' : `networkx.Graph`}
.
Notes
The graph label is also set to the graph’s name.
- __module__ = 'netflow.keepers.keeper'#
- add_graph(graph, label)[source]#
Add a network to the keeper.
- Parameters:
graph (networkx.Graph) – The network.
label (str) – Reference label describing the network.
- property graphs#
A dictionary of all the graphs.
- class netflow.keepers.keeper.Keeper(data=None, distances=None, similarities=None, graphs=None, misc=None, observation_labels=None, outdir=None, verbose=None)[source]#
A class to store and handle data, distances and similarities between data points (or observations), and miscellaneous related results.
- Parameters:
data ({numpy.ndarray, pandas.DataFrame, dict [str, numpy.ndarray], dict [str, pandas.DataFrame]}) –
One or multiple feature data sets, where each data set is size (num_features, num_observations).
Feature data set(s) may be provided in multiple ways:
numpy.ndarray : A single feature data set.
Observation labels may be specified in
observation_labels
.To include feature labels,
data
should be a pandas.DataFrame.The array
data
is placed in a dict with default label of the form{'data' : data}
.
pandas.DataFrame : A single feature data set.
The index indicates feature labels.
Columns indicate observation labels.
If
observation_labels
is provided, it should match the columns of the dataframe, that will be ordered according to theobservation_labels
.The array
data.values
is placed in a dict with default label of the form{'data' : data.values}
.
- dict [str, numpy.ndarray]A single or mutliple feature data set(s) may be provided as the value(s)
of a dict keyed by a str that serves as the feature data descriptor or label for each input of the form
{'data_label' : `numpy.ndarray`}
.
All arrays are expected to have the same number of columns corresponding to the number of observations with the same ordering.
Observation labels may be specified in
observation_labels
.To include feature labels, pass a dict of pandas.DataFrames instaed.
- dict [str, pandas.DataFrame]A single or mutliple feature data set(s) may be provided as the value(s)
of a dict keyed by a str that serves as the feature data descriptor or label for each input of the form
{'data_label' : `pandas.DataFrame`}
.
All dataframes are expected to have the same number of columns corresponding to the number of observations with the same columns provided in the same order.
The index of each input indicates feature labels.
Columns indicate observation labels.
If
observation_labels
is provided, it should match the columns of the dataframe(s), that will be ordered according to theobservation_labels
.
distances ({numpy.ndarray, pandas.DataFrame, dict [str, numpy.ndarray], dict [str, pandas.DataFrame]}) –
One or multiple symmetric distance(s) between observations, where each distance matrix is size (num_observations, num_observations).
Distance(s) may be provided in multiple ways:
numpy.ndarray : A single distance matrix.
Observation labels may be specified in
observation_labels
.To include feature labels,
distances
should be a pandas.DataFrame.The array
distances
is placed in a dict with default label of the form{'distance' : distances}
.
pandas.DataFrame : A single distance matrix.
The index should be the same as the columns, which indicate observation labels.
If
observation_labels
is provided, it should match the rows and columns of the dataframe, that will be ordered according to theobservation_labels
.The array
distances.values
is placed in a dict with default label of the form{'distance' : distances.values}
.
- dict [str, numpy.ndarray]A single or mutliple distance(s) may be provided as the value(s)
of a dict keyed by a str that serves as the distance descriptor or label for each input of the form
{'distance_label' : `numpy.ndarray`}
.
All arrays are expected to have the same number of columns and rows corresponding to the number of observations with the same ordering.
Observation labels may be specified in
observation_labels
.
- dict [str, pandas.DataFrame]A single or mutliple distance(s) may be provided as the value(s)
of a dict keyed by a str that serves as the distance descriptor or label for each input of the form
{'distance_label' : `pandas.DataFrame`}
.
All dataframes are expected to have the same number of rows and columns corresponding to the number of observations with the same columns provided in the same order.
The index and columns indicate observation labels.
If
observation_labels
is provided, it should match the columns of the dataframe(s), that will be ordered according to theobservation_labels
.
similarities ({numpy.ndarray, pandas.DataFrame, dict [str, numpy.ndarray], dict [str, pandas.DataFrame]}) – One or multiple symmetric similarit(y/ies) between observations, where each similarity matrix is size (num_observations, num_observations). Similarit(y/ies) may be provided in multiple ways, analogous to
distances
.graphs (: {networkx.Graph, dict [str’, `networkx.Graph]}) –
One or multiple networks.
The network(s) may be provided in multiple ways:
networkx.Graph : A single network.
The network is placed in a dict with default label of the form
{'graph' : graphs}
.To use a customized label for the network, provide the network as a dict, shown next.
dict [str, networkx.Graph] : A single or multiple networks may be provided as the value(s) of a dict keyed by a str that serves as the netwwork descriptor or label for each input of the form
{'graph_label' : `networkx.Graph`}
.
misc (dict) – Miscellaneous data or results.
observation_labels (list [str], optional) –
List of labels for each observation with length equal to num_observations.
Labels will be set depending on the input format accordingly :
If
observation_labels
is provided anddata
,distances
, orsimilarities
is a :pandas.DataFrame or dict [str, pandas.DataFrame], then
observation_labels
should match the columns of the dataframe(s), which will be ordered according toobservation_labels
.numpy.ndarray or dict [str, numpy.ndarray], then the array(s) is (are) assumed to be ordered according to
observation_labels
.
If
observation_labels
is not provided anddata
,distances
, orsimilarities
is a :pandas.DataFrame or dict [str, pandas.DataFrame], then all dataframes are expected to have the same column names, which is used as the
observation_labels
.numpy.ndarray or dict [str, numpy.ndarray], then the array(s) is (are) assumed to have columns corresponding to the observations in the same order. Default values are used for
observation_labels
of the form ‘X0’, ‘X1’, … and so on.
outdir ({None, str pathlib.Path}) – Global path where any results will be saved. If None, no results will be saved.
Notes
All data sets, distances and similarities are assumed to contain the same set of data points in the same order.
Subsets of a data set, distance or similarity should be stored in
Keeper.misc
as a pandas.DataFrame to keep track of the subset of observations (and features).- PCA(key, n_components=None, random_state=None)[source]#
Principle component analysis (PCA) decomposition.
- Parameters:
key (str) – The reference key of the data in the data-keeper that PCA decomposition will be performed on.
n_components ({None, int}) – The number of principle components to keep. If None, all principle components are kept.
random_state ({None, int}) – Random state used for certain solvers. Pass an int for reproducible results across runs.
- Return type:
PCA data with label “{key}_PCA” is added to the data keeper.
- __init__(data=None, distances=None, similarities=None, graphs=None, misc=None, observation_labels=None, outdir=None, verbose=None)[source]#
- __module__ = 'netflow.keepers.keeper'#
- _check_num_observations()[source]#
Check that num_observation are consistent (or None) across keepers.
- _check_observation_labels()[source]#
Check that observation_labels are consistent (or None) across keepers.
- add_data(data, label)[source]#
Add a feature data set to the data keeper.
- Parameters:
data ({numpy.ndarray, pandas.DataFrame}) – The data set of size (num_features, num_observations).
label (str) – Reference label describing the data set.
- add_distance(data, label)[source]#
Add a distance array to the distance keeper.
- Parameters:
data ({numpy.ndarray, pandas.DataFrame}) – The distance array of size (num_observations, num_observations).
label (str) – Reference label describing the distance.
- add_graph(graph, label)[source]#
Add a network to the graph keeper.
- Parameters:
graph (networkx.Graph) – The network to add.
label (str) – Reference label describing the network.
- add_misc(data, label)[source]#
Add misc information to be stored.
- Parameters:
data – The misc information, e.g., a graph.
label (str) – Reference label describing the input.
- add_similarity(data, label)[source]#
Add a similarity array to the similarity keeper.
- Parameters:
data ({numpy.ndarray, pandas.DataFrame}) – The similarity array of size (num_observations, num_observations).
label (str) – Reference label describing the similarity.
- add_stacked_distance(data, label)[source]#
Add a stacked distance array to the distance keeper.
- Parameters:
data (pandas.Series) – The stacked distances of size (num_observations * (num_observations - 1) / 2,) with a 2-multi-index of the pairwise observation labels.
label (str) – Reference label describing the distance.
- add_stacked_similarity(data, label, diag=1.0)[source]#
Add a stacked similarity array to the similarity keeper.
- Parameters:
data (pandas.Series) – The stacked similarities of size (num_observations * (num_observations - 1) / 2,) with a 2-multi-index of the pairwise observation labels.
label (str) – Reference label describing the similarity.
diag (float) – Value used on diagonal.
- compute_dpt_from_augmented_sym_transitions(key, n_comps: int = 0, save_eig=False)[source]#
Compute the diffusion pseudotime metric between observations, computed from the symmetric transitions.
Note
\(T\) is the symmetric transition matrix
\(M(x,z) = \sum_{i=1}^{n-1} (\lambda_i * (1 - \lambda_i))\psi_i(x)\psi_i^T(z)\)
\(dpt(x,z) = ||M(x, .) - M(y, .)||^2\)
- Parameters:
key (str) – Reference ID for the symmetric transitions numpy.ndarray, (n_observations, n_observations) stored in
keeper.misc
.- Returns:
dpt (numpy.ndarray, (n_observations, n_observations)) – Pairwise-observation Diffusion pseudotime distances are stored in keeper.distances[dpt_key] where
dpt_key="dpt_from_{key}"
. If the full spectrum is not used (i.e.,0 < n_comps < n_observations"
), thendpt_key="dpt_from_{key}_{n_comps}comps"
.n_comps (int) – Number of eigenvalues/vectors to be computed, set
n_comps = 0
to compute the whole spectrum. Alternatively, if setn_comps >= n_observations
, the whole spectrum will be computed.
- compute_dpt_from_similarity(similarity_key, density_normalize: bool = True, n_comps: int = 0, save_eig=False)[source]#
Compute the diffusion pseudotime metric between observations, computed from similarity
Note
This entails computing the augmented symmetric transitions.
\(T\) is the symmetric transition matrix
\(M(x,z) = \sum_{i=1}^{n-1} (\lambda_i * (1 - \lambda_i))\psi_i(x)\psi_i^T(z)\)
\(dpt(x,z) = ||M(x, .) - M(y, .)||^2\)
- Parameters:
similarity_key (str) – Reference key to the numpy.ndarray, (n_observations, n_observations) symmetric similarity measure (with 1s on the diagonal) stored in the similarities in the keeper.
density_normalize (bool) – The density rescaling of Coifman and Lafon (2006): Then only the geometry of the data matters, not the sampled density.
n_comps (int) – Number of eigenvalues/vectors to be computed, set
n_comps = 0
to compute the whole spectrum. Alternatively, if setn_comps >= n_observations
, the whole spectrum will be computed.
- Returns:
- transitions_asymnumpy.ndarray, (n_observations, n_observations)
Asymmetric Transition matrix (with 0s on the diagonal) added to
keeper.misc[f"transitions_asym_{similarity_key}"]
.- transitions_symnumpy.ndarray, (n_observations, n_observations)
Symmetric Transition matrix (with 0s on the diagonal) added to
keeper.misc[f"transitions_sym_{similarity_key}"]
.- dptnumpy.ndarray, (n_observations, n_observations)
Pairwise-observation Diffusion pseudotime distances are stored in keeper.distances[dpt_key] where
dpt_key="dpt_from_transitions_asym_{similarity_key}"
. If the full spectrum is not used (i.e.,0 < n_comps < n_observations"
), thendpt_key="dpt_from_transitions_asym_{similarity_key}_{n_comps}comps"
.
- Return type:
The following are stored in the keeper
- compute_multiscale_VNE_transition_from_similarity(similarity_key, tau_max=None, do_save=True)[source]#
Compute the multi-scale transition matrix based on the elbow of the Von Neumann Entropy (VNE)
as described in GSPA and PHATE KrishnaswamyLab/spARC, https://pdfs.semanticscholar.org/16ab/e92b7630d5b84b904bde97dad9b9fbce406c.pdf.
- Parameters:
similarity_key (str) – Reference key to the numpy.ndarray, (n_observations, n_observations) symmetric similarity measure (with 1s on the diagonal) stored in the similarities in the keeper.
tau_max (int) – Max scale
tau
tested for VNE (default is 100).do_save (bool) – If True, save to
keeper
.
- Returns:
P (numpy.ndarray (n_observations, n_observations)) – The symmetric VNE multi-scale transition matrix (with 0s on the diagonals). If
do_save
is True,P
is added to thekeeper.misc
with the key'transitions_sym_multiscaleVNE_{similarity_key}'
P_asym (numpy.ndarray (n_observations, n_observations)) – The random-walk VNE multi-scale transition matrix (with 0s on the diagonals). If
do_save
is True,P_asym
is added to thekeeper.misc
with the key'transitions_multiscaleVNE_{similarity_key}'
- compute_rw_transitions_from_similarity(similarity_key)[source]#
Compute the row-stochastic transition matrix and store in keeper.
- Parameters:
similarity_key (str) – Reference key to the numpy.ndarray, (n_observations, n_observations) symmetric similarity measure (with 1s on the diagonal) stored in the similarities in the keeper.
- Returns:
- transitions_rw_{similarity_key}numpy.ndarray, (n_observations, n_observations)
Asymmetric random walk transition matrix.
- Return type:
Adds the following to keeper.misc (with 0s on the diagonals)
- compute_sigmas(distance_key, label=None, n_neighbors=None, method='max', return_nn=False)[source]#
Set sigma for each obs as the distance to its k-th neighbor from keeper.
- Parameters:
distance_key (str) – The label used to reference the distance matrix stored in
keeper.distances
, of size (n_observations, n_observations).label ({None, str}) –
If provided, this is appended to the context tag (
tag = f"{method}{n_neighbors}nn_{distance_key}"
). The key used to store the results defaults to the tag whenlabel
is not provided:key = tag
. Otherwise, the key is set to:key = tag + "-" + label
. The resulting sigmas are stored inkeeper.misc['sigmas_' + key]
.If
return_nn
is True, nearest neighbor indices are stored inkeeper.misc['nn_indices_' + key]
and nearest neighbor distances are stored inkeeper.misc['nn_distances_' + key]
.n_neighbors ({int, None}) – K-th nearest neighbor (or number of nearest neighbors) to use for computing
sigmas
,n_neighbors > 0
. (Usesn_neighbors + 1
, since each obs is it’s closest neighbor). If None, all neighbors are used.method ({'mean', 'median', 'max'}) –
Indicate how to compute sigma.
Options:
’mean’ : mean of distance to
n_neighbors
nearest neighbors’median’ : median of distance to
n_neighbors
nearest neighbors’max’ : distance to
n_neighbors
-nearest neighbor
return_nn (bool) – If True, also store indices and distances of
n_neighbors
nearest neighbors.
- Returns:
sigmas (numpy.ndarray, (n_observations, )) – The distance to the k-th nearest neighbor for all rows in
d
. Sigmas represent the kernel width representing each data point’s accessible neighbors. Written tokeeper.misc['sigmas_' + key]
.indices (numpy.ndarray, (n_observations, )) – Indices of nearest neighbors where each row corresponds to an observation. Written, if
return_nn
is True, tokeeper.misc['nn_indices_' + key]
.distances (numpy.ndarray, (n_observations,
n_neighbors + 1
)) – Distances to nearest neighbors where each row corresponds to an obs. Written, ifreturn_nn
is True, tokeeper.misc['nn_distances_' + key]
.
- compute_similarity_from_distance(distance_key, n_neighbors, method, precomputed_method=None, label=None, knn=False)[source]#
Convert distance matrix to symmetric similarity measure.
The resulting similarity is written to the similarity keeper.
- Parameters:
distance_key (str) – The label used to reference the distance matrix stored in
keeper.distances
, of size (n_observations, n_observations).n_neighbors ({int, None}) – K-th nearest neighbor (or number of nearest neighbors) to use for computing
sigmas
,n_neighbors > 0
. (Usesn_neighbors + 1
, since each obs is it’s closest neighbor). If None, all neighbors are used.method ({float, ‘mean’, ‘median’, ‘max’, ‘precomputed’}) –
Indicate how to compute sigma.
Options:
float : constant float to use as sigma
int : constant int to use as sigma
’mean’ : mean of distance to
n_neighbors
nearest neighbors’median’ : median of distance to
n_neighbors
nearest neighbors’max’ : distance to
n_neighbors
-nearest neighbor’precomputed’ : precomputed values extracted from
keeper.misc[f"sigmas_{key}"]
as a numpy.ndarray of size (n_observations, ).
precomputed_method ({'mean', 'median', 'max'}) – This is ignored if method is not ‘precomputed’. When method is ‘precomputed’, specify the method that was previously used for computing sigmas. See method for description of options.
label ({None, str}) – If provided, this is appended to the context tag (
tag = f"{method}{n_neighbors}nn_{distance_key}"
) The key used to store the resulting similarity matrix of size (n_observations, n_observations) inkeeper.similarities[f"similarity_{key}]
defaults to the tag whenlabel
is not provided:key = tag
. Otherwise, the key is set to:key = tag + "-" + label
.knn (bool) – If True, restrict similarity measure to be non-zero only between
n_neighbors
nearest neighbors.
- Returns:
K – Symmetric similarity measure. Written to
keeper.similarities[key]
.- Return type:
numpy.ndarray, (n_observations, n_observations)
- compute_transitions_from_similarity(similarity_key, density_normalize: bool = True)[source]#
Compute symmetric and asymmetric transition matrices and store in keeper.
Note
Code primarily copied from scanpy.neighbors.
- Parameters:
similarity_key (str) – Reference key to the numpy.ndarray, (n_observations, n_observations) symmetric similarity measure (with 1s on the diagonal) stored in the similarities in the keeper.
density_normalize (bool) – The density rescaling of Coifman and Lafon (2006): Then only the geometry of the data matters, not the sampled density.
- Returns:
- transitions_asym_{similarity_key}numpy.ndarray, (n_observations, n_observations)
Asymmetric Transition matrix.
- Transitions_sym_{similarity_key}numpy.ndarray, (n_observations, n_observations)
Symmetric Transition matrix.
- Return type:
Adds the following to keeper.misc (with 0s on the diagonals)
- construct_pose(key, root=None, root_as_tip=False, min_branch_size=5, choose_largest_segment=False, flavor='haghverdi16', allow_kendall_tau_shift=False, smooth_corr=True, brute=True, split=True, verbose=None, n_branches=2, until_branched=False, mutual=False, k_mnn=3, connect_closest=False, connect_trunk='classic', annotate=True)[source]#
Construct the POSE from specified distance.
- Parameters:
key (str) – The label used to reference the distance matrix stored in
keeper.distances
, of size (n_observations, n_observations).root ({None, int, ‘density’, ‘density_inv’, ‘ratio’}) –
The root. If None, ‘density’ is used.
Options:
int : index of observation
’density’ : select observation with minimal distance-density
’density_inv’ : select observation with maximal distance-density
’ratio’ : select observation which leads to maximal triangular ratio distance
root_as_tip (bool) – If True, force first tip as the root. Defaults to False following scanpy implementation.
min_branch_size ({int, float}) – During recursive splitting of branches, only consider splitting a branch with at least
min_branch_size > 2
data points. If a float,min_branch_size
refers to the fraction of the total number of data points (0 < min_branch_size < 1
).choose_largest_segment (bool) – If True, select largest segment for branching.
flavor ({'haghverdi16', 'wolf17_tri', 'wolf17_bi', 'wolf17_bi_un'}) – Branching algorithm (based on scanpy implementation).
allow_kendall_tau_shift (bool) – If a very small branch is detected upon splitting, shift away from maximum correlation in Kendall tau criterion of [Haghverdi16] to stabilize the splitting.
smooth_corr (bool, default = False) – If True, smooth correlations before identifying cut points for branch splitting.
brute (bool) – If True, data points not associated with any branch upon split are combined with undecided (trunk) points. Otherwise, if False, they are treated as individual islands, not associated with any branch (and assigned branch index -1).
split (bool (default = True)) – if True, split segment into multiple branches. Otherwise, determine a single branching off of the main segment. This is ignored if flavor is not ‘haghverdi16’. If True,
brute
is ignored.n_branches (int) – Number of branches to look for (
n_branches > 0
).until_branched (bool) – If True, iteratively find segment to branch and perform branching until a segement is successfully branched or no branchable segments remain. Otherwise, if False, attempt to perform branching only once on the next potentially branchable segment. Note: This is only applicable when branching is being performed. If previous iterations of branching has already been performed, it is not possible to identify the number of iterations where no branching was performed.
mutual (bool (default = False)) – If True, add
k_mnn
mutual nn edges. Otherwise, add single nn edge. When False,k_mnn
is ignored.k_mnn (int (
0 < k_mnn < len(G)
)) – The number of nns to consider when extracting mutual nns. Note, this is ignored whenmutual
is False.connect_closest (bool (default = False)) – If True, connect branches by points with smallest distance between the branches. Otherwise, connect by continuum of ordering.
connect_trunk ({'classic', 'endpoint', 'dual'}, default = 'classic') –
Specify how to connect segments to unresolved/unidentified trunk. Note, this only applies when a split results in a trunk consisting of unresolved/unidentified points. Additionally, this is ignored if
flavor ~= 'haghverdi16'
. It is also ignored Ifflavor = `haghverdi16'
andsplit = False
.Options:
classic : point identified in trunk is connected to the point in the segment closest to it
endpoint : point identified in trunk is connected to the the segment’s second tip
dual : point identified in trunk is connected to both points determined by classic and endpoint
annotate (bool) – If True, annotate edges and nodes with POSE features.
- Returns:
poser (netflow.pose.POSER) – The object used to construct the POSE.
G_poser_nn (networkx.Graph) – The updated graph with nearest neighbor edges. If
annotate
is True, edge attribute “edge_origin” is added with the possible values :”POSE” : for edges in the original graph that are not nearest neighbor edges
”NN” : for nearest neighbor edges that were not in the original graph
”POSE + NN” : for edges in the original graph that are also nearest neighbor edges
- convert_similarity_to_distance(similarity_key)[source]#
Convert a similarity to a distance.
The distance, computed as 1-similarity, is added to the distance keeper with the key
"distance_from_" + similarity_key
- Parameters:
similarity_key (str) – The similiratiy reference key.
- Returns:
- dThe new distance is saved to the keeper in
keeper.distances[f"distance_from_{similarity_key}"]
.
- Return type:
The following are saved to the distance keeper
- property data#
The feature data sets.
- distance_density(label)[source]#
Compute each observation’s density from a distance.
The density of an observation is its net distance to all other observations.
- Parameters:
label (str) – The reference label for the distance.
- Returns:
density – The densities indexed by the observation labels.
- Return type:
pandas.Series
- distance_density_argmax(label)[source]#
Find the observation with the largest density from a distance.
The density of an observation is its net distance to all other observations.
- Parameters:
label (str) – The reference label for the distance.
- Returns:
obs – The index of the observation with the largest density.
- Return type:
int
- distance_density_argmin(label)[source]#
Find the observation with the smallest density from a distance.
The density of an observation is its net distance to all other observations.
- Parameters:
label (str) – The reference label for the distance.
- Returns:
obs – The index of the observation with the smallest density.
- Return type:
int
- property distances#
The distances.
- euc_distance_pairwise_observation_feature_nbhd(data_key, graph_key, features=None, include_self=False, label=None, metric='euclidean', normalize=False, **kwargs)[source]#
Compute Euclidean distances between feature neighborhoods of every two observations
Note
If
object.outdir
is not None, Euclidean distances are saved to file every 10 iterations. Before starting the computation, check if the file exists. If so, load and remove already computed nodes from the iteration. Euclidean distances are computed for the remaining nodes, combined with the previously computed and saved results before saving and returning the combined results.Only nodes with at least 2 neighbors are included, as leaf nodes will all have the same Euclidean distance and do not provide any further information.
The resulting observation-pairwise Euclidean distances are saved to misc (aka self.misc) and can be accessed by
self.misc[f"{data_key}_{label}_nbhd_euc_with{'' if include_self else 'out'}_self"]
.- Parameters:
data_key (str) – The key to the data in the data keeper that should be used.
graph_key ('str') – The key to the graph in the graph keeper that should be used. (Does not have to include all features in the data)
features ({None, list [str])}) – List of features (nodes) to compute neighborhood distances on. If None, all features are used.
include_self (bool) – If True, add node in neighborhood which will result in computing normalized profile over the neighborhood. If False, node is not included in neighborhood which results in computing the transition distribution over the neighborhood.
label (str) – Label that resulting Wasserstein distances are saved in
keeper.misc
and name of file to store stacked results.metric (str) – The metric used to compute the distance, passed to scipy.spatial.distance.cdist.
normalize (bool) – If True, normalize neighborhood profiles to sum to 1.
**kwargs (dict) – Extra arguments to metric, passed to scipy.spatial.distance.cdist.
- Returns:
eds – Euclidean distances between pairwise observations where rows are observation-pairs and columns are feature (node) names. saved in
keeper.misc
with the keyf"{data_key}_{label}_nbhd_euc_with{'' if include_self else 'out'}_self"
.- Return type:
pandas.DataFrame
- euc_distance_pairwise_observation_profile(data_key, features=None, label=None, metric='euclidean', normalize=False, **kwargs)[source]#
Compute Euclidean distances between feature profiles of every two observations
Note
If
object.outdir
is not None, Euclidean distances are saved to file. Before starting the computation, check if the file exists. If so, load and remove already computed nodes from the iteration. Euclidean distances are computed for the remaining nodes, combined with the previously computed and saved results before saving and returning the combined results.Only nodes with at least 2 neighbors are included, as leaf nodes will all have the same Wassserstein distance and do not provide any further information.
The resulting observation-pairwise Wasserstein distances are saved to the DistanceKeeper (aka self.distances) and can be accessed by
self.distances[f'{data_key}_{label}_profile_euc']
.- Parameters:
data_key (str) – The key to the data in the data keeper that should be used.
features ({None, list [str])}) – List of features to compute profile distances on. If None, all features are used.
label (str) – Label that resulting Wasserstein distances are saved in
keeper.distances
and name of file to store stacked results.metric (str) – The metric used to compute the distance, passed to scipy.spatial.distance.cdist.
normalize (bool) – If True, normalize neighborhood profiles to sum to 1.
**kwargs (dict) – Extra arguments to metric, passed to scipy.spatial.distance.cdist.
- Returns:
eds – Euclidean distances between pairwise profiles where rows are observation-pairs and columns are node names, saved in
keeper.distances
with the keyf'{data_key}_{label}_profile_euc'
.- Return type:
pandas.DataFrame
- fuse_similarities(similarity_keys, weights=None, fused_key=None)[source]#
Fuse similarities in the keeper
- Parameters:
similarity_keys (list) – Reference keys of similiraties to fuse.
weights (list) – (Optional) Weight each similarity contributes to the fused similarity. Should be the same length as
similarity_keys
. If not provided, default behavior is to apply uniform weights.fused_key (str) – (Optional) Specify key used to store the fused similarity in the keeper. Default behavior is to fuse the keys of the original similarities.
- Returns:
fused_sim : The fused similarity, where the reference key, if not provided, is fused from the original labels.
- Return type:
The following is added to the similarity keeper
- property graphs#
The networks.
- integrate_multiscale_VNE_transitions_from_similarities(similarity_keys, tau_max=None, integrated_key=None)[source]#
Integrate multi-scale transitions where scale is determined by elbow of Von Neumann Entropy (VNE)
As described in https://pdfs.semanticscholar.org/16ab/e92b7630d5b84b904bde97dad9b9fbce406c.pdf.
- Parameters:
similarity_keys (list) – Reference keys of similarities to compute transitions and integrate.
tau_max (int) – Max scale
tau
to test Von Neumann Entropy on (default is 100).integrated_key (str) – (Optional) Specify key used to store the integrated transitions into the keeper. Default behavior is to integrate the keys of the original transitions.
- Returns:
integrated transition : The (symmetric) integrated multi-scale transition, where the reference key, if not provided, is fused from the original labels.
integrated asymmetric transition : The (asymmetric random-walk) integrated multi-scale transition, where the reference key, if not provided, is fused from the original labels.
- Return type:
The following is added to the similarity keeper
- integrate_transitions(transition_keys, integrated_key=None)[source]#
Integrate transitions
- Parameters:
transition_keys (list) – Reference keys of transitions to integrate.
integrated_key (str) – (Optional) Specify key used to store the integrated transitions into the keeper. Default behavior is to integrate the keys of the original transitions.
- Returns:
integrated transition : The integrated transition, where the reference key, if not provided, is fused from the original labels.
- Return type:
The following is added to the similarity keeper
- integrate_transitions_from_similarity(similarity_keys, integrated_key=None, density_normalize: bool = True, n_comps: int = 0)[source]#
Compute the integrated transition from similarities,
Note
This entails computing the augmented symmetric transitions.
\(T\) is the symmetric transition matrix
- Parameters:
similarity_keys (list) – Reference keys of similarities to compute transitions and integrate.
integrated_key (str) – (Otional) Specify key used to store the integrated transitions into the keeper. Default behavior is to integrate the keys of the original transitions.
density_normalize (bool) – The density rescaling of Coifman and Lafon (2006): Then only the geometry of the data matters, not the sampled density.
n_comps (int) – Number of eigenvalues/vectors to be computed, set
n_comps = 0
to compute the whole spectrum. Alternatively, if setn_comps >= n_observations
, the whole spectrum will be computed.
- Returns:
- transitions_asymnumpy.ndarray, (n_observations, n_observations)
Asymmetric Transition matrix (with 0s on the diagonal) added to
keeper.misc[f"transitions_asym_{similarity_key}"]
.
- transitions_symnumpy.ndarray, (n_observations, n_observations)
Symmetric Transition matrix (with 0s on the diagonal) added to
keeper.misc[f"transitions_sym_{similarity_key}"]
.
transitions_i : The integrated transition, where the reference key, if not provided, is fused from the original labels.
- Return type:
The following are stored in the keeper
- load_data(file_name, label='data', file_path=None, file_format=None, delimiter=',', dtype=None, cols_as_obs=True, **kwargs)[source]#
Load data from file into the keeper.
Note
Currently loads data using
pandas.read_csv
. Additional formats will be added in the future.- Parameters:
file_name ({str, pathlib.Path}) – Input data file name.
label (str, (default: ‘data’)) – Reference label describing the data set.
file_path ({str pathlib.Path}, optional (default: None)) – File path. Empty string by default
file_format (str, optional (default: None)) – File format. Currently supported file formats: ‘txt’, ‘csv’, ‘tsv’. If None,
file_format
will be inferred from the file extension infile_name
. Currently, this is ignored.delimiter (str, optional (default: ‘,’)) – Delimiter to use.
dtype – If provided, ensure to convert data type after loaded.
cols_as_obs (bool (default = True)) – If True, columns in the loaded data are observations, otherwise, the rows are observations.
**kwargs – Additional key-word arguments passed to
pandas.read_csv
.
- load_distance(file_name, label='distance', file_path=None, file_format=None, delimiter=',', **kwargs)[source]#
Load distance from file into the keeper.
Note
Assumed that the distance array is stored with the first row and first column as the index and header, respectively.
Currently loads data using
pandas.read_csv
. Additional formats will be added in the future.- Parameters:
file_name ({str, pathlib.Path}) – Input distance file name.
label (str, (default: ‘distance’)) – Reference label describing the data set.
file_path ({str pathlib.Path}, optional (default: None)) – File path. Empty string by default
file_format (str, optional (default: None)) – File format. Currently supported file formats: ‘txt’, ‘csv’, ‘tsv’. If None,
file_format
will be inferred from the file extension infile_name
. Currently, this is ignored.delimiter (str, optional (default: ‘,’)) – Delimiter to use.
**kwargs – Additional key-word arguments passed to
pandas.read_csv
.
- load_graph(file_name, label='graph', file_path=None, file_format=None, delimiter=',', source='source', target='target', **kwargs)[source]#
Load network (edgelist) from file into graph and store in the keeper.
Note
Currently loads graph from edgelist. Future release will allow different graph types (e.g., adjacency, graphml).
Assumed that the edge-list is stored as two columns, where the first row is labeled as source and target.
Currently loads data using
pandas.read_csv
. Additional formats will be added in the future.- Parameters:
file_name ({str, pathlib.Path}) – Input edge-list file name.
label (str, (default: ‘graph’)) – Reference label describing the network.
file_path ({str pathlib.Path}, optional (default: None)) – File path. Empty string by default
file_format (str, optional (default: None)) – File format. Currently supported file formats: ‘txt’, ‘csv’, ‘tsv’. If None,
file_format
will be inferred from the file extension infile_name
. Currently, this is ignored.delimiter (str, optional (default: ‘,’)) – Delimiter to use.
source ({str, int} (default: ‘source’)) – A valid column name (string or integer) for the source nodes passed to
networkx.from_pandas_edgelist
.target ({str, int} (default: ‘target’)) – A valid column name (string or integer) for the target nodes passed to
networkx.from_pandas_edgelist
.**kwargs – Additional key-word arguments passed to
pandas.read_csv
.
- load_similarity(file_name, label='similarity', file_path=None, file_format=None, delimiter=',', **kwargs)[source]#
Load similarity from file into the keeper.
Note
Assumed that the distance array is stored with the first row and first column as the index and header, respectively.
Currently loads data using
pandas.read_csv
. Additional formats will be added in the future.- Parameters:
file_name ({str, pathlib.Path}) – Input similarity file name.
label (str, (default: ‘similarity’)) – Reference label describing the data set.
file_path ({str pathlib.Path}, optional (default: None)) – File path. Empty string by default
file_format (str, optional (default: None)) – File format. Currently supported file formats: ‘txt’, ‘csv’, ‘tsv’. If None,
file_format
will be inferred from the file extension infile_name
. Currently, this is ignored.delimiter (str, optional (default: ‘,’)) – Delimiter to use.
**kwargs – Additional key-word arguments passed to
pandas.read_csv
.
- load_stacked_distance(file_name, label='distance', file_path=None, file_format=None, delimiter=',', **kwargs)[source]#
Load distance in stacked form from file, convert to unstacked form and store in the keeper.
Note
Assumed that the stacked distances are stored with a 2-multi-index of the pairwise-observattion (excluding self-pairs) and a single column with the pairwise distances.
Currently loads data using
pandas.read_csv
. Additional formats will be added in the future.- Parameters:
file_name ({str, pathlib.Path}) – Input distance file name.
label (str, (default: ‘distance’)) – Reference label describing the data set.
file_path ({str pathlib.Path}, optional (default: None)) – File path. Empty string by default
file_format (str, optional (default: None)) – File format. Currently supported file formats: ‘txt’, ‘csv’, ‘tsv’. If None,
file_format
will be inferred from the file extension infile_name
. Currently, this is ignored.delimiter (str, optional (default: ‘,’)) – Delimiter to use.
**kwargs – Additional key-word arguments passed to
pandas.read_csv
.
- load_stacked_similarity(file_name, label='similarity', diag=1.0, file_path=None, file_format=None, delimiter=',', **kwargs)[source]#
Load similarity in stacked form from file, convert to unstacked form and store in the keeper.
Note
Assumed that the stacked distances are stored with a 2-multi-index of the pairwise-observattion (excluding self-pairs) and a single column with the pairwise distances.
Currently loads data using
pandas.read_csv
. Additional formats will be added in the future.- Parameters:
file_name ({str, pathlib.Path}) – Input distance file name.
label (str, (default: ‘distance’)) – Reference label describing the data set.
diag (float) – Value used on diagonal.
file_path ({str pathlib.Path}, optional (default: None)) – File path. Empty string by default
file_format (str, optional (default: None)) – File format. Currently supported file formats: ‘txt’, ‘csv’, ‘tsv’. If None,
file_format
will be inferred from the file extension infile_name
. Currently, this is ignored.delimiter (str, optional (default: ‘,’)) – Delimiter to use.
**kwargs – Additional key-word arguments passed to
pandas.read_csv
.
- log1p(key, base=None)[source]#
Logarithmic data transformation.
Computes \(data = \log(data + 1)\) with the natural logarithm as the default base.
- Parameters:
key (str) – The reference key of the data in the data-keeper that will be logarithmically transformed.
base ({None, int}) – Base used for the logarithmic transformation.
- Return type:
Logarithmically transformed data with label “{key}_log1p” is added to the data keeper.
- property misc#
The misc data.
- property num_observations#
The number of observations.
- observation_index(observation_label)[source]#
Return index of observation.
- Parameters:
observation_label (str) – The observation label.
- Returns:
observation_index – Index of observation in list of observation labels.
- Return type:
int
- property observation_labels#
Labels for each observation.
- save_data(label, file_format='txt', delimiter=',', **kwargs)[source]#
Save data to file.
Note
This currently only saves a pandas DataFrame to .txt, .csv, or .tsv. Future releases will allow for other formats.
Data set is saved to the file named ‘{self.outdir}/data_{label}.{file_format}’.
- Parameters:
label (str) – Reference label describing which data set to save.
file_format (str) – File format. Currently supported file formats: ‘txt’, ‘csv’, ‘tsv’.
delimiter (str, optional (default: ‘,’)) – Delimiter to use.
**kwargs – Additional key-word arguments passed to
pandas.to_csv
.
- save_distance(label, file_format='txt', delimiter=',', **kwargs)[source]#
Save distance to file.
Note
This currently only saves a pandas DataFrame to .txt, .csv, or .tsv. Future releases will allow for other formats.
Distance is saved to the file named ‘{self.outdir}/distance_{label}.{file_format}’.
- Parameters:
label (str) – Reference label describing which distance to save.
file_format (str) – File format. Currently supported file formats: ‘txt’, ‘csv’, ‘tsv’.
delimiter (str, optional (default: ‘,’)) – Delimiter to use.
**kwargs – Additional key-word arguments passed to
pandas.to_csv
.
- save_misc(label, file_format='txt', delimiter=',', **kwargs)[source]#
Save misc data to file.
Note
This currently only saves a pandas DataFrame to .txt, .csv, or .tsv. Future releases will allow for other formats.
Misc data is saved to the file named ‘{self.outdir}/misc_{label}.{file_format}’.
- Parameters:
label (str) – Reference label describing which misc data to save.
file_format (str) – File format. Currently supported file formats: ‘txt’, ‘csv’, ‘tsv’.
delimiter (str, optional (default: ‘,’)) – Delimiter to use.
**kwargs – Additional key-word arguments passed to
pandas.to_csv
.
- save_similarity(label, file_format='txt', delimiter=',', **kwargs)[source]#
Save similarity to file.
Note
This currently only saves a pandas DataFrame to .txt, .csv, or .tsv. Future releases will allow for other formats.
Similarity is saved to the file named ‘{self.outdir}/similarity_{label}.{file_format}’.
- Parameters:
label (str) – Reference label describing which similarity to save.
file_format (str) – File format. Currently supported file formats: ‘txt’, ‘csv’, ‘tsv’.
delimiter (str, optional (default: ‘,’)) – Delimiter to use.
**kwargs – Additional key-word arguments passed to
pandas.to_csv
.
- property similarities#
The similarities.
- similarity_density(label)[source]#
Compute each observation’s density from a similarity.
The density of an observation is its net similarity to all other observations.
- Parameters:
label (str) – The reference label for the similarity.
- Returns:
density – The densities indexed by the observation labels.
- Return type:
pandas.Series
- similarity_density_argmax(label)[source]#
Find the observation with the largest density from a similarity.
The density of an observation is its net similarity to all other observations.
- Parameters:
label (str) – The reference label for the similarity.
- Returns:
obs – The label of the observation with the largest density.
- Return type:
str
- standardize(key, label=None, **kwargs)[source]#
Standardize features in DataKeeper by removing the mean and scaling to unit variance
- Parameters:
key (str) – The reference key of the data in the data-keeper that will be standardized.
label ({str, None}) – The label used to store the standardized data. If None, default label
f'{key}_z'
is used.kwargs – Keyword arguments passed to sklearn.preprocessing.StandardScalar.
- Returns:
data_z – The standardized data.
- Return type:
pandas.DataFrame
- subset(observations, keep_misc=False, keep_graphs=False, outdir=None)[source]#
Return a new instance of Keeper restricted to subset of observations.
The default behavior is to not include misc or graphs in the Keeper subset. This is because there is no check for which observations the misc and graphs correspond to.
Warning: The subset keeper and all data it contains is not a copy.
- Parameters:
observations (list) –
List of observations to include in the subset. This is treated differently depending on the type of observation labels :
If
self._observation_labels
is List [str],observations
can be of the form:List [str], to reference observations by their str label or;
List [int], to reference observations by their location index.
If
self._observation_labels
is None or List [int],observations
must be of the form List [int], to reference observations by their location index.
keep_misc (bool) – If True, misc is added to the new Keeper.
keep_graphs (bool) – If True, the graphs are added to the new Keeper.
outdir ({None, str pathlib.Path}) – Global path where any results will be saved. If None, no results will be saved.
- Returns:
keeper_subset – A Keeper object restricted to the selected observations.
- Return type:
Keeper
- wass_distance_pairwise_observation_feature_nbhd(data_key, graph_key, features=None, include_self=False, label=None, graph_distances=None, edge_weight=None, proc=4, chunksize=None, measure_cutoff=1e-06, solvr=None)[source]#
Compute Wasserstein distances between feature neighborhoods of every two observations
Note
If
object.outdir
is not None, Wasserstein distances are saved to file every 10 iterations. Before starting the computation, check if the file exists. If so, load and remove already computed nodes from the iteration. Wasserstein distances are computed for the remaining nodes, combined with the previously computed and saved results before saving and returning the combined results.Only nodes with at least 2 neighbors are included, as leaf nodes will all have the same Wassserstein distance and do not provide any further information.
The resulting observation-pairwise Wasserstein distances are saved to misc (aka self.misc) and can be accessed by
self.misc[f"{data_key}_{label}_nbhd_wass_with{'' if include_self else 'out'}_self"]
.- Parameters:
data_key (str) – The key to the data in the data keeper that should be used.
graph_key ('str') – The key to the graph in the graph keeper that should be used. (Does not have to include all features in the data)
features ({None, list [str])}) – List of features (nodes) to compute neighborhood distances on. If None, all features are used.
include_self (bool) – If True, add node in neighborhood which will result in computing normalized profile over the neighborhood. If False, node is not included in neighborhood which results in computing the transition distribution over the neighborhood.
label (str) – Label that resulting Wasserstein distances are saved in
keeper.misc
and name of file to store stacked results.graph_distances (numpy.ndarray, (n, n)) – A matrix of node-pairwise graph distances between the \(n\) nodes (ordered from \(0, 1, ..., n-1\)). If None, use hop distance.
edge_weight ({None, str}) – The edge attribute used as the weight for computed the graph distances. This is ignored if
graph_distances
is provided, If None, no edge weight is used.measure_cutoff (float) – Threshold for treating values in profiles as zero, default = 1e-6.
proc (int) – Number of processor used for multiprocessing. (Default value = cpu_count()).
chunksize (int) – Chunksize to allocate for multiprocessing.
solvr (str) – Solver to pass to POT library for computing Wasserstein distance.
- Returns:
wds – Wasserstein distances between pairwise observations where rows are observation-pairs and columns are feature (node) names. saved in
keeper.misc
with the keyf"{data_key}_{label}_nbhd_wass_with{'' if include_self else 'out'}_self"
.- Return type:
pandas.DataFrame
- wass_distance_pairwise_observation_profile(data_key, graph_key, features=None, label=None, graph_distances=None, edge_weight=None, proc=4, chunksize=None, measure_cutoff=1e-06, solvr=None)[source]#
Compute Wasserstein distances between feature profiles of every two observations
Note
If
object.outdir
is not None, Wasserstein distances are saved to file every 10 iterations. Before starting the computation, check if the file exists. If so, load and remove already computed nodes from the iteration. Wasserstein distances are computed for the remaining nodes, combined with the previously computed and saved results before saving and returning the combined results.Only nodes with at least 2 neighbors are included, as leaf nodes will all have the same Wassserstein distance and do not provide any further information.
The resulting observation-pairwise Wasserstein distances are saved to the DistanceKeeper (aka self.distances) and can be accessed by
self.distances[f'{data_key}_{label}_wass_dist_observation_pairwise_profiles']
.- Parameters:
data_key (str) – The key to the data in the data keeper that should be used.
graph_key ('str') – The key to the graph in the graph keeper that should be used. (Does not have to include all features in the data)
features ({None, list [str])}) – List of features to compute profile distances on. If None, all features are used.
label (str) – Label that resulting Wasserstein distances are saved in
keeper.distances
and name of file to store stacked results.graph_distances (numpy.ndarray, (n, n)) – A matrix of node-pairwise graph distances between the \(n\) nodes (ordered from \(0, 1, ..., n-1\)). If None, use hop distance.
edge_weight ({None, str}) – The edge attribute used as the weight for computed the graph distances. This is ignored if
graph_distances
is provided, If None, no edge weight is used.measure_cutoff (float) – Threshold for treating values in profiles as zero, default = 1e-6.
proc (int) – Number of processor used for multiprocessing. (Default value = cpu_count()).
chunksize (int) – Chunksize to allocate for multiprocessing.
solvr (str) – Solver to pass to POT library for computing Wasserstein distance.
- Returns:
wds – Wasserstein distances between pairwise profiles where rows are observation-pairs and columns are node names, saved in
keeper.distances
with the keyf'{data_key}_{label}_profile_wass'
,- Return type:
pandas.DataFrame