netflow.methods.metrics#
Functions
|
Compute the optimal transportation distance (OTD) of the given density distributions trying first with POT package and then by CVXPY. |
|
Norm of multi-feature data points in keeper. |
|
Norm of multi-feature data points. |
|
Construct symmetric distance matrix between observations from the norm of multi-feature pairwise-observation distances. |
|
Compute the optimal transportation distance (OTD) of the given density distributions by CVXPY. |
|
Compute observation-pairwise Euclidean distances between the profiles. |
|
Compute observation-pairwise Wasserstein distances between the profiles on a fixed weighted graph. |
|
Compute Wasserstein distance between profiles of two observations. |
- netflow.methods.metrics.OTD(x, y, d, solvr=None, flag=None)[source]#
Compute the optimal transportation distance (OTD) of the given density distributions trying first with POT package and then by CVXPY.
- Parameters:
x ((m,) numpy.ndarray) – Source’s density distributions, includes source and source’s neighbors.
y ((n,) numpy.ndarray) – Target’s density distributions, includes source and source’s neighbors.
d ((m, n) numpy.ndarray) – Shortest path matrix.
flag (str, optional) – Optional flag included in logger output to identify inputs.
- Returns:
m – Optimal transportation distance.
- Return type:
- netflow.methods.metrics.norm_features(keeper, key, features=None, method='L1', label=None)[source]#
Norm of multi-feature data points in keeper.
Intended to compute the norm of pairwise distances between observations \(s_q\) and \(s_r\) for all features \(f_i \in F\): \(D_{qr}^{(F)} = [d_{qr}^{(f_1)}, ..., d_{qr}^{(f_m)}]\)
The multi-feature distances array is of the form \(D^{(F)} = [d_{ij}]\) of size (n_observations * (n_observations - 1) / 2, n_features). \(F\) is the set of features where \(|F| = ` n_features, with a multi-index of size 2 of the form :math:`(obs_i, obs_j)\) and \(d_{ij}\) is the distance between pairwise-observations \(i = (obs_p, obs_q)\) with respect to feature \(j\).
If
label
is provided, the resulting norm is stored inkeeper.misc[label]
. Otherwise, iflabel
is None, the resulting norm is returned.- Parameters:
keeper (netflow.Keeper) – The keeper object that stores the multiple pairwise-observation distances.
key (str) – The label used to reference the multiple pairwise-observation distances stored in
keeper.misc
, of size (n_observations * (n_observations - 1) / 2, n_features). The norm is computed on the rows ofkeeper.misc[key]
.features ({None, List [str]}) – Subset of features to include. If provided, restrict to norm over columns corresponding to features in the specified list. If None, use all columns.
method ({'L1', 'L2', 'inf', 'mean', 'median'}) –
Indicate which norm to compute. For each row of the form \(x = [x_1, x_2, ..., x_n]\) :
Options:
’L1’ : \(\sum_{i=1}^n abs(x_i)\)
’L2’ : \(\sqrt{\sum_{i=1}^n (x_i)^2}\)
’inf’ : \(max_i abs(x_i)\)
’mean’ : Mean of \(x\)
’median’ : Median of \(x\)
label ({None, str}) – Label used to store resulting norm in
keeper.misc
. If None, the resulting norm is returned instead of storing it in the keeper.
- Returns:
n – Norm of the row(s) of length n_observations * (n_observations - 1) / 2..
- Return type:
pandas.Series
- netflow.methods.metrics.norm_features_(X, method='L1')[source]#
Norm of multi-feature data points.
Intended to compute the norm of pairwise distances between observations \(s_q\) and \(s_r\) for all features \(f_i \in F\): \(D_{qr}^{(F)} = [d_{qr}^{(f_1)}, ..., d_{qr}^{(f_m)}]\)
The multi-feature distances array is of the form \(D^{(F)} = [d_{ij}]\) of size (n_observations * (n_observations - 1) / 2, n_features). \(F\) is the set of features where \(|F| = ` n_features, with a multi-index of size 2 of the form :math:`(obs_i, obs_j)\) and \(d_{ij}\) is the distance between pairwise-observations \(i = (obs_p, obs_q)\) with respect to feature \(j\).
- Parameters:
X (pandas.DataFrame, (n_observations * (n_observations - 1) / 2, n_features)) – The matrix of multiple pairwise-observation distances. The norm is computed on the rows of
X
.method ({'L1', 'L2', 'inf', 'mean', 'median'}) –
Indicate which norm to compute. For each row of the form \(x = [x_1, x_2, ..., x_n]\) :
Options:
’L1’ : \(\sum_{i=1}^n abs(x_i)\)
’L2’ : \(\sqrt{\sum_{i=1}^n (x_i)^2}\)
’inf’ : \(max_i abs(x_i)\)
’mean’ : Mean of \(x\)
’median’ : Median of \(x\)
- Returns:
n – Norm of the row(s). When not a float, If
X
is a pandas.DataFrame,n
is a pandas.Series. Otherwise, ifX
is a numpy.ndarray,n
is a numpy.ndarray vector.- Return type:
float or array-like
- netflow.methods.metrics.norm_features_as_sym_dist(keeper, key, label, features=None, method='L1', is_distance=True)[source]#
Construct symmetric distance matrix between observations from the norm of multi-feature pairwise-observation distances.
Intended to compute the norm of pairwise distances between observations \(s_q\) and \(s_r\) for all features \(f_i \in F\): \(D_{qr}^{(F)} = [d_{qr}^{(f_1)}, ..., d_{qr}^{(f_m)}]\)
The multi-feature distances array is of the form \(D^{(F)} = [d_{ij}]\) of size (n_observations * (n_observations - 1) / 2, n_features). \(F\) is the set of features where \(|F| = ` n_features, with a multi-index of size 2 of the form :math:`(obs_i, obs_j)\) and \(d_{ij}\) is the distance between pairwise-observations \(i = (obs_p, obs_q)\) with respect to feature \(j\).
The multi-feature pairwise-observations are by default treated as distances, but they could be similarities. If they are similarities, set
is_distance
to False. Note: similarities will have 1. along the diagonal.- Parameters:
keeper (netflow.Keeper) – The keeper object that stores the multiple pairwise-observation distances.
key (str) – The label used to reference the multiple pairwise-observation distances stored in
keeper.misc
, of size (n_observations * (n_observations - 1) / 2, n_features). The norm is computed on the rows ofkeeper.misc[key]
.label (str) – Label used to store resulting distance matrix. If
is_distance
is True, the matrix is stored inkeeper.distances[label]
. Otherwise, it is a similarity matrix stored inkeeper.similarity[label]
.features ({None, List [str]}) – Subset of features to include. If provided, restrict to norm over columns corresponding to features in the specified list. If None, use all columns.
method ({'L1', 'L2', 'inf', 'mean', 'median'}) –
Indicate which norm to compute. For each row of the form \(x = [x_1, x_2, ..., x_n]\) :
Options:
’L1’ : \(\sum_{i=1}^n abs(x_i)\)
’L2’ : \(\sqrt{\sum_{i=1}^n (x_i)^2}\)
’inf’ : \(max_i abs(x_i)\)
’mean’ : Mean of \(x\)
’median’ : Median of \(x\)
is_distance (bool) – Indicate if the multi-feature pairwise-observations are distances or similarities. If True, they are treated as distances. If False, they are treated as similarities.
- netflow.methods.metrics.optimal_transportation_distance(x, y, d, solvr=None)[source]#
Compute the optimal transportation distance (OTD) of the given density distributions by CVXPY.
- Parameters:
x (numpy.ndarray, (m,)) – Source’s density distributions, includes source and source’s neighbors.
y (numpy.ndarray, (n,)) – Target’s density distributions, includes source and source’s neighbors.
d (numpy.ndarray, (m, n)) – Shortest path matrix.
- Returns:
m – Optimal transportation distance.
- Return type:
float
- netflow.methods.metrics.pairwise_observation_euc_distances(profiles, metric='euclidean', **kwargs)[source]#
Compute observation-pairwise Euclidean distances between the profiles.
- Parameters:
profiles (pandas.DataFrame, (n_features, n_observations)) – Profiles that Euclidean distance is computed between.
metrics (str or callable, optional) – The distance metric to use passed to scipy.spatial.distance.cdist.
**kwargs (dict, optional) – Extra arguments to metric, passed to scipy.spatial.distance.cdist.
- Returns:
ed – Euclidean distances between pairwise observations
- Return type:
pandas Series
- netflow.methods.metrics.pairwise_observation_wass_distances(profiles, graph_distances, proc=4, pairwise_obs_list=None, chunksize=None, measure_cutoff=1e-06, solvr=None, flag=None)[source]#
Compute observation-pairwise Wasserstein distances between the profiles on a fixed weighted graph.
- Parameters:
profiles (pandas.DataFrame, (n_features, n_observations)) – Profiles that are normalized and treated as probability distributions for computing Wasserstein distance.
graph_distances (numpy.ndarray, (n, n)) – A matrix of node-pairwise graph distances between the n nodes (ordered from 0, 1, …, n-1).
measure_cutoff (float) – Threshold for treating values in profiles as zero, default = 1e-6.
proc (int) – Number of processor used for multiprocessing. (Default value = cpu_count()).
pairwise_obs_list (list [2-tuple]) – (Optional) Provide restricted list of pairwise observations that the Wasserstein distance should be computed between. Observations are expected to be the columns of
profiles
. If not provided, the pairwise Wasserstein distance is computed between every two observations inprofiles
.chunksize (int) – Chunksize to allocate for multiprocessing.
solvr (str) – Solver to pass to POT library for computing Wasserstein distance.
- Returns:
wd – Wasserstein distances between pairwise observations
- Return type:
pandas Series
- netflow.methods.metrics.wass_distance(observations, profiles, graph_distances, measure_cutoff=1e-06, solvr=None, flag=None)[source]#
Compute Wasserstein distance between profiles of two observations.
- Parameters:
observations (2-tuple) – Profile columns names referring to the two observations that the Wasserstein distance should be computed between.
profiles (pandas.DataFrame) – Observation profiles with features as rows and observations as columns.
graph_distances (numpy.ndarray, (n’, n’)) – A matrix of node-pairwise graph distances between the \(n'\) nodes ordered by the rows in
profiles
.measure_cutoff (float) – Threshold for treating values in profiles as zero, default = 1e-6.
- Returns:
wasserstein_distance – The Wasserstein distance.
- Return type:
float