netflow.methods.metrics#

Functions

OTD(x, y, d[, solvr, flag])

Compute the optimal transportation distance (OTD) of the given density distributions trying first with POT package and then by CVXPY.

norm_features(keeper, key[, features, ...])

Norm of multi-feature data points in keeper.

norm_features_(X[, method])

Norm of multi-feature data points.

norm_features_as_sym_dist(keeper, key, label)

Construct symmetric distance matrix between observations from the norm of multi-feature pairwise-observation distances.

optimal_transportation_distance(x, y, d[, solvr])

Compute the optimal transportation distance (OTD) of the given density distributions by CVXPY.

pairwise_observation_euc_distances(profiles)

Compute observation-pairwise Euclidean distances between the profiles.

pairwise_observation_wass_distances(...[, ...])

Compute observation-pairwise Wasserstein distances between the profiles on a fixed weighted graph.

wass_distance(observations, profiles, ...[, ...])

Compute Wasserstein distance between profiles of two observations.

netflow.methods.metrics.OTD(x, y, d, solvr=None, flag=None)[source]#

Compute the optimal transportation distance (OTD) of the given density distributions trying first with POT package and then by CVXPY.

Parameters:
  • x ((m,) numpy.ndarray) – Source’s density distributions, includes source and source’s neighbors.

  • y ((n,) numpy.ndarray) – Target’s density distributions, includes source and source’s neighbors.

  • d ((m, n) numpy.ndarray) – Shortest path matrix.

  • flag (str, optional) – Optional flag included in logger output to identify inputs.

Returns:

m – Optimal transportation distance.

Return type:

float

netflow.methods.metrics.norm_features(keeper, key, features=None, method='L1', label=None)[source]#

Norm of multi-feature data points in keeper.

Intended to compute the norm of pairwise distances between observations \(s_q\) and \(s_r\) for all features \(f_i \in F\): \(D_{qr}^{(F)} = [d_{qr}^{(f_1)}, ..., d_{qr}^{(f_m)}]\)

The multi-feature distances array is of the form \(D^{(F)} = [d_{ij}]\) of size (n_observations * (n_observations - 1) / 2, n_features). \(F\) is the set of features where \(|F| = ` n_features, with a multi-index of size 2 of the form :math:`(obs_i, obs_j)\) and \(d_{ij}\) is the distance between pairwise-observations \(i = (obs_p, obs_q)\) with respect to feature \(j\).

If label is provided, the resulting norm is stored in keeper.misc[label]. Otherwise, if label is None, the resulting norm is returned.

Parameters:
  • keeper (netflow.Keeper) – The keeper object that stores the multiple pairwise-observation distances.

  • key (str) – The label used to reference the multiple pairwise-observation distances stored in keeper.misc, of size (n_observations * (n_observations - 1) / 2, n_features). The norm is computed on the rows of keeper.misc[key].

  • features ({None, List [str]}) – Subset of features to include. If provided, restrict to norm over columns corresponding to features in the specified list. If None, use all columns.

  • method ({'L1', 'L2', 'inf', 'mean', 'median'}) –

    Indicate which norm to compute. For each row of the form \(x = [x_1, x_2, ..., x_n]\) :

    Options:

    • ’L1’ : \(\sum_{i=1}^n abs(x_i)\)

    • ’L2’ : \(\sqrt{\sum_{i=1}^n (x_i)^2}\)

    • ’inf’ : \(max_i abs(x_i)\)

    • ’mean’ : Mean of \(x\)

    • ’median’ : Median of \(x\)

  • label ({None, str}) – Label used to store resulting norm in keeper.misc. If None, the resulting norm is returned instead of storing it in the keeper.

Returns:

n – Norm of the row(s) of length n_observations * (n_observations - 1) / 2..

Return type:

pandas.Series

netflow.methods.metrics.norm_features_(X, method='L1')[source]#

Norm of multi-feature data points.

Intended to compute the norm of pairwise distances between observations \(s_q\) and \(s_r\) for all features \(f_i \in F\): \(D_{qr}^{(F)} = [d_{qr}^{(f_1)}, ..., d_{qr}^{(f_m)}]\)

The multi-feature distances array is of the form \(D^{(F)} = [d_{ij}]\) of size (n_observations * (n_observations - 1) / 2, n_features). \(F\) is the set of features where \(|F| = ` n_features, with a multi-index of size 2 of the form :math:`(obs_i, obs_j)\) and \(d_{ij}\) is the distance between pairwise-observations \(i = (obs_p, obs_q)\) with respect to feature \(j\).

Parameters:
  • X (pandas.DataFrame, (n_observations * (n_observations - 1) / 2, n_features)) – The matrix of multiple pairwise-observation distances. The norm is computed on the rows of X.

  • method ({'L1', 'L2', 'inf', 'mean', 'median'}) –

    Indicate which norm to compute. For each row of the form \(x = [x_1, x_2, ..., x_n]\) :

    Options:

    • ’L1’ : \(\sum_{i=1}^n abs(x_i)\)

    • ’L2’ : \(\sqrt{\sum_{i=1}^n (x_i)^2}\)

    • ’inf’ : \(max_i abs(x_i)\)

    • ’mean’ : Mean of \(x\)

    • ’median’ : Median of \(x\)

Returns:

n – Norm of the row(s). When not a float, If X is a pandas.DataFrame, n is a pandas.Series. Otherwise, if X is a numpy.ndarray, n is a numpy.ndarray vector.

Return type:

float or array-like

netflow.methods.metrics.norm_features_as_sym_dist(keeper, key, label, features=None, method='L1', is_distance=True)[source]#

Construct symmetric distance matrix between observations from the norm of multi-feature pairwise-observation distances.

Intended to compute the norm of pairwise distances between observations \(s_q\) and \(s_r\) for all features \(f_i \in F\): \(D_{qr}^{(F)} = [d_{qr}^{(f_1)}, ..., d_{qr}^{(f_m)}]\)

The multi-feature distances array is of the form \(D^{(F)} = [d_{ij}]\) of size (n_observations * (n_observations - 1) / 2, n_features). \(F\) is the set of features where \(|F| = ` n_features, with a multi-index of size 2 of the form :math:`(obs_i, obs_j)\) and \(d_{ij}\) is the distance between pairwise-observations \(i = (obs_p, obs_q)\) with respect to feature \(j\).

The multi-feature pairwise-observations are by default treated as distances, but they could be similarities. If they are similarities, set is_distance to False. Note: similarities will have 1. along the diagonal.

Parameters:
  • keeper (netflow.Keeper) – The keeper object that stores the multiple pairwise-observation distances.

  • key (str) – The label used to reference the multiple pairwise-observation distances stored in keeper.misc, of size (n_observations * (n_observations - 1) / 2, n_features). The norm is computed on the rows of keeper.misc[key].

  • label (str) – Label used to store resulting distance matrix. If is_distance is True, the matrix is stored in keeper.distances[label]. Otherwise, it is a similarity matrix stored in keeper.similarity[label].

  • features ({None, List [str]}) – Subset of features to include. If provided, restrict to norm over columns corresponding to features in the specified list. If None, use all columns.

  • method ({'L1', 'L2', 'inf', 'mean', 'median'}) –

    Indicate which norm to compute. For each row of the form \(x = [x_1, x_2, ..., x_n]\) :

    Options:

    • ’L1’ : \(\sum_{i=1}^n abs(x_i)\)

    • ’L2’ : \(\sqrt{\sum_{i=1}^n (x_i)^2}\)

    • ’inf’ : \(max_i abs(x_i)\)

    • ’mean’ : Mean of \(x\)

    • ’median’ : Median of \(x\)

  • is_distance (bool) – Indicate if the multi-feature pairwise-observations are distances or similarities. If True, they are treated as distances. If False, they are treated as similarities.

netflow.methods.metrics.optimal_transportation_distance(x, y, d, solvr=None)[source]#

Compute the optimal transportation distance (OTD) of the given density distributions by CVXPY.

Parameters:
  • x (numpy.ndarray, (m,)) – Source’s density distributions, includes source and source’s neighbors.

  • y (numpy.ndarray, (n,)) – Target’s density distributions, includes source and source’s neighbors.

  • d (numpy.ndarray, (m, n)) – Shortest path matrix.

Returns:

m – Optimal transportation distance.

Return type:

float

netflow.methods.metrics.pairwise_observation_euc_distances(profiles, metric='euclidean', **kwargs)[source]#

Compute observation-pairwise Euclidean distances between the profiles.

Parameters:
  • profiles (pandas.DataFrame, (n_features, n_observations)) – Profiles that Euclidean distance is computed between.

  • metrics (str or callable, optional) – The distance metric to use passed to scipy.spatial.distance.cdist.

  • **kwargs (dict, optional) – Extra arguments to metric, passed to scipy.spatial.distance.cdist.

Returns:

ed – Euclidean distances between pairwise observations

Return type:

pandas Series

netflow.methods.metrics.pairwise_observation_wass_distances(profiles, graph_distances, proc=4, pairwise_obs_list=None, chunksize=None, measure_cutoff=1e-06, solvr=None, flag=None)[source]#

Compute observation-pairwise Wasserstein distances between the profiles on a fixed weighted graph.

Parameters:
  • profiles (pandas.DataFrame, (n_features, n_observations)) – Profiles that are normalized and treated as probability distributions for computing Wasserstein distance.

  • graph_distances (numpy.ndarray, (n, n)) – A matrix of node-pairwise graph distances between the n nodes (ordered from 0, 1, …, n-1).

  • measure_cutoff (float) – Threshold for treating values in profiles as zero, default = 1e-6.

  • proc (int) – Number of processor used for multiprocessing. (Default value = cpu_count()).

  • pairwise_obs_list (list [2-tuple]) – (Optional) Provide restricted list of pairwise observations that the Wasserstein distance should be computed between. Observations are expected to be the columns of profiles. If not provided, the pairwise Wasserstein distance is computed between every two observations in profiles.

  • chunksize (int) – Chunksize to allocate for multiprocessing.

  • solvr (str) – Solver to pass to POT library for computing Wasserstein distance.

Returns:

wd – Wasserstein distances between pairwise observations

Return type:

pandas Series

netflow.methods.metrics.wass_distance(observations, profiles, graph_distances, measure_cutoff=1e-06, solvr=None, flag=None)[source]#

Compute Wasserstein distance between profiles of two observations.

Parameters:
  • observations (2-tuple) – Profile columns names referring to the two observations that the Wasserstein distance should be computed between.

  • profiles (pandas.DataFrame) – Observation profiles with features as rows and observations as columns.

  • graph_distances (numpy.ndarray, (n’, n’)) – A matrix of node-pairwise graph distances between the \(n'\) nodes ordered by the rows in profiles.

  • measure_cutoff (float) – Threshold for treating values in profiles as zero, default = 1e-6.

Returns:

wasserstein_distance – The Wasserstein distance.

Return type:

float