Breast cancer example#

Pose pipeline overview#

(Single-modal) In this tutorial, we demonstrate how to perform the POSE pipeline on Breast Cancer (BC) RNA data from TCGA. This analysis entails the following:

Upload data
Compute pairwise sample (WE) distances with respect to gene neigborhoods
Compute global pairwise sample (WE) distance matrices
Convert global distances to a single sample pairwise similarity matrix
(Future work - dpt distance and/or multi-scale diffusion based distances)
Extract pseudo-organization (i.e., ordering) of samples.
Determine schema (i.e., branching).
Visualize schema
(Future work - further investigation into samples within different branches and differential analysis between branches)

$d_W(i,j,v)$ - The Wasserstein distance of the 1-hop neighborhood around gene $v$ between sample $i$ and sample $j$
$d_E(i,j,v)$ - The Euclidean distance of the 1-hop neighborhood around gene $v$ between sample $i$ and sample $j$
$D_W(i,j) = |d_W(i,j,v)|$ - Net Wasserstein distance between sample $i$ and sample $j$ wrt all genes $v$
$D_E(i,j) = |d_E(i,j,v)|$ - Net Euclidean distance between sample $i$ and sample $j$ wrt all genes $v$
$K_W = e^{-\frac{|D_W|^2}{\sigma^2}}$ - Pairwise sample Wasserstein similarity matrix
$K_E = e^{-\frac{|D_E|^2}{\sigma^2}}$ - Pairwise sample Euclidean similarity matrix
(Multi-feature, multi-modal) $K = \frac{K_W + K_E}{2}$ - Fused pairwise sample similarity matrix
$D = 1 - K$ - Fused pairwise sample distance matrix
(Clustering) Determine branching according to lineage tracing algorithm using $D$
(Visualizing) Pseudo-ordering of samples in branches according to distance from root node $r$, i.e., $D(r,:)$

Input

X : RNA-Seq data matrix
genelist : list of genes of interest in X
G : gene-gene (or protein-protein) interaction network

Algorithm

Compute sample-pairwise distances:
1. Compute sample-pairwise gene-1-hop-neighborhood distances:
  - Wasserstein distances (shape): For each gene g in the genelist, compute $$d_W^{(g)}(sample_i, sample_j)$$ which is the Wasserstein distance of gene g’s 1-hop neighborhood between every two samples i and j.
  - Euclidean distances (scale): For each gene g in the genelist, compute $$d_E^{(g)}(sample_i, sample_j)$$ which is the Euclidean distance of gene g’s 1-hop neighborhood (including self) between every two samples i and j.
2. Compute net sample-pairwise distances:
  - Wasserstein distance (shape): $$d_W(sample_i, sample_j) = |d_W^{(g)}(sample_i, sample_j)|$$ as the $L_1$ or $L_2$ norm of the vector of gene-1-hop-neighborhood Wasserstein distances for all genes g in the genelist.
  - Euclidean distance (scale): $$d_E(sample_i, sample_j) = |d_E^{(g)}(sample_i, sample_j)|$$ as the $L_1$ or $L_2$ norm of the vector of gene-1-hop-neighborhood Euclidean distances for all genes g in the genelist.
Convert net sample-pairwise distances to standardized similarities:
- $\sigma$ : Kernel width representing each data point’s accessible neighbors.
- Set $\sigma$ normalization for each obs as the distance to its k-th neighbor : $$K_{ij}(d) = \sqrt{\frac{2\sigma_i\sigma_j}{\sigma_i^2 + \sigma_j^2}}e^{-\frac{|d_{ij}^2|}{2(\sigma_i^2 + \sigma_j^2)}}$$ where for each observation $i$, $\sigma_i$ is the distance to its $k$-th nearest neighbor.
  - Standardized sample-pairwise Wasserstein similarity: $$K_W = K(d_W)$$
  - Standardized sample-pairwise Euclidean similarity: $$K_E = K(d_E)$$
Compute fused similarity $\tilde{K}$:
- $\tilde{K} = \frac{K_W + K_E}{2}$ (Note : could choose other weights to fuse with an unbalanced combination)
Compute diffusion distance from similarity: $d_K$
Compute POSE from diffusion distance: $G_{POSE}$
PROBE the POSE:
- Clustering
- Visualization

Aknowledgement

The diffusion distance and pseudo-ordering is performed according to the lineage tracing algorithm presented in Haghverdi, et al. (2019).
The code for computing the ordering is largely adopted from the scanpy implementation in Python.

First, import the necessary packages:

Load libraries#

import pathlib
import sys

from collections import defaultdict as ddict
import itertools
import matplotlib.colors as cm
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd
import scipy.sparse as sc_sparse
from tqdm import tqdm

If netflow has not been installed, add the path to the library:

sys.path.insert(0, pathlib.Path(pathlib.Path('.').absolute()).parents[3].resolve().as_posix())
# sys.path.insert(0, pathlib.Path(pathlib.Path('.').absolute()).parents[0].resolve().as_posix())

import netflow as nf
from netflow.probe import visualization as nfv

Set up directories#

MAIN_DIR = pathlib.Path('.').absolute()

Paths to where data is stored:

DATA_DIR = MAIN_DIR / 'example_data' / 'breast_tcga'

RNA_FNAME = DATA_DIR / 'rna_606.txt'
E_RNA_FNAME = DATA_DIR / 'edgelist_hprd_rna_606.txt'

CNA_FNAME = DATA_DIR / 'cna_606.txt'
E_CNA_FNAME = DATA_DIR / 'edgelist_hprd_cna_606.txt'

METH_FNAME = DATA_DIR / 'methylation_606.txt'
E_METH_FNAME = DATA_DIR / 'edgelist_hprd_methylation_606.txt'

CLIN_FNAME = DATA_DIR / 'clin_606.txt'

Directory where output should be saved:

OUT_DIR = MAIN_DIR / 'example_data' / 'results_netflow_breast_tcga'

Load clinical data#

clin = pd.read_csv(CLIN_FNAME, header=0, index_col=0)
print(clin.shape)

(606, 34)

clin.head(2)

	PATIENT_ID	SAMPLE_TYPE	CANCER_TYPE	CANCER_TYPE_DETAILED	ONCOTREE_CODE	TMB_NONSYNONYMOUS	OS_STATUS	OS_MONTHS	DFS_STATUS	DFS_MONTHS	...	PATH_MARGIN	AJCC_TUMOR_PATHOLOGIC_PT	AJCC_NODES_PATHOLOGIC_PN	AJCC_METASTASIS_PATHOLOGIC_PM	AJCC_PATHOLOGIC_TUMOR_STAGE	AJCC_STAGING_EDITION	Buffa Hypoxia Score	Ragnum Hypoxia Score	Winter Hypoxia Score	PAM50
TCGA-E9-A295-01	TCGA-E9-A295	Primary	Breast Cancer	Breast Invasive Lobular Carcinoma	ILC	1.866667	0:LIVING	12.32	0:DiseaseFree	12.32	...	Negative	T2	N0 (i-)	M0	Stage IIA	7th	-19.0	-4.0	0.0	LumA
TCGA-AR-A1AS-01	TCGA-AR-A1AS	Primary	Breast Cancer	Breast Invasive Ductal Carcinoma	IDC	0.933333	0:LIVING	37.78	0:DiseaseFree	37.78	...	Negative	T2	N1	M0	Stage IIB	6th	11.0	10.0	14.0	LumA

2 rows × 34 columns

Initialize the Keeper#

We use the same bulk-RNA TCGA breast cancer data (specifically, the RNA and methylation) as previously described in the Keeper tutorial.

# uncomment to initialize Keeper with no output directory:
# keeper = nf.Keeper() 

# initializing Keeper with output directory:
keeper = nf.Keeper(outdir=OUT_DIR)

Load the feature data (RNA and methylation) from file:

keeper.load_data(RNA_FNAME, label='rna', header=0, index_col=0) 
keeper.load_data(METH_FNAME, label='meth', header=0, index_col=0)

Load the associated graphs from their edgelist files:

keeper.load_graph(E_RNA_FNAME, label='rna')
keeper.load_graph(E_METH_FNAME, label='meth')

Compute observation-pairwise distances#

We can compute observation-pairwise distances with respect to the feature data. This can be performed on all features, but here we select a specific set of immune related features for demonstration.

# gene_sig = ['HLA-DRB1', 'CD80', 'IL11', 'TGFB2', 'KLRG1', 'HLA-DPB1', 'IL6', 'ITGB7', 
#             'SNAI1', 'CD3D', 'CCL5', 'ZAP70', 'GZMB', 'ARG1', 'CD247', 'TGFBR2', 'ICOS', 
#             'CD86', 'LAT', 'CD3G', 'VAV1', 'CD163', 'PTPN6', 'HLA-DQA2', 'IFNG', 'CD3E', 
#             'CTLA4', 'TUBA1A', 'ITK', 'ANGPTL4', 'CD8B', 'CD4', 'IL10', 'GZMA', 'ZEB1', 
#             'KLF10', 'CD40LG', 'HLA-DPA1', 'CD28', 'LCK', 'CCR5', 'IL17A', 'PDCD1', 
#             'HLA-DRB5', 'FOXP3', 'TGFBR1', 'IL4', 'VEGFA', 'CXCR3', 'LCP2', 'CSF3', 
#             'HLA-DQB1', 'ACTB', 'IL2', 'IL1B', 'TNF', 'TGFB1', 'IL1A', 'IL13', 'CD8A']

gene_sig_label = 'immune_signature'
gene_sig = ['ACTB', 'CCR5', 'CD163', 'CD247', 'CD28', 'CD3D', 'CD3E', 'CD3G', 
            'CD4', 'CD80', 'CD86', 'CD8A', 'CD8B', 'GZMA', 'GZMB', 'IFNG', 
            'IL10', 'IL1A', 'IL1B', 'IL6', 'ITGB7', 'ITK', 'KLF10', 'LAT', 
            'LCK', 'LCP2', 'PTPN6', 'SNAI1', 'TGFB2', 'TGFBR1', 'TGFBR2', 
            'TNF', 'TUBA1A', 'VAV1', 'VEGFA', 'ZAP70', 'ZEB1']
print(len(gene_sig))

We check that all features in gene_sig are also in the graphs:

for g in keeper.graphs:
    print(len(set(gene_sig) & set(g)))

37
37

Note: In general, the same features do not have to be used for each data set. In fact, the data sets may not even have any features in common. Here, we use the same selected set of features for simplicity of the demonstration.

Feature neighborhood distances#

For each feature, we can compute distances, such as Wasserstein and Euclidean, between feature neighborhood profiles of every two observations.

Next we compute the observation-pairwise Wasserstein distances between the 1-hop transition distribution from each feature in the selected gene signature gene_sig.

The argument include_self determines if the feature itself should be included in its neighborhood. When computing transition probabilities, the weight of the feature itself ends up being excluded via the mass-action principle. When interested in the transition probabilities, set include_self = False. Otherwise, if interested in the distance between normalized profiles over the neighborhood, set include_self = True.

Here, we are demonstrate the transition probability context and set include_self = False.

See the documentation for a description of all arguments.

keeper.outdir

PosixPath('/Users/renae12/Documents/MSKCC/dev/netflow/docs/source/tutorial/notebooks/example_data/results_netflow_breast_tcga')

for dd in keeper.data:
    keeper.wass_distance_pairwise_observation_feature_nbhd(dd.label, dd.label, features=gene_sig,
                                                           include_self=False, label=gene_sig_label)

Computing pairwise 1-hop nbhd Wasserstein distances: 100%|█████████████████████████████████████████| 36/36 [04:45<00:00,  7.93s/it]

netflow.methods.classes: 05/01/2025 05:05:01  PM | MSG |                                                           
classes:multiple_pairwise_observation_neighborhood_wass_distance:864 | >>> Observation pairwise 1-hop neighborhood 
Wasserstein distances on rna_immune_signature_nbhd_wass_without_self saved to                                      
rna_immune_signature_nbhd_wass_without_self.csv.

Computing pairwise 1-hop nbhd Wasserstein distances: 100%|█████████████████████████████████████████| 36/36 [04:52<00:00,  8.13s/it]

netflow.methods.classes: 05/01/2025 05:10:23  PM | MSG |                                                           
classes:multiple_pairwise_observation_neighborhood_wass_distance:864 | >>> Observation pairwise 1-hop neighborhood 
Wasserstein distances on meth_immune_signature_nbhd_wass_without_self saved to                                     
meth_immune_signature_nbhd_wass_without_self.csv.

As a result, for each feature data (RNA, CNA and methylation), the Wasserstein distances between pairwise observations, where rows are observation-pairs and columns are feature (node) names is saved to keeper.misc keyed by f"{data_key}_{label}_nbhd_euc_with{'' if include_self else 'out'}_self".

For example, we can see the RNA results as follows:

# keeper.misc[f"rna_{gene_sig_label}_wass_dist_observation_pairwise_nbhds_without_self"].head(2)

keeper.misc[f"rna_{gene_sig_label}_nbhd_wass_without_self"].head(2)

		ACTB	CCR5	CD163	CD247	CD28	CD3D	CD3E	CD3G	CD4	CD80	...	SNAI1	TGFB2	TGFBR1	TGFBR2	TNF	TUBA1A	VAV1	VEGFA	ZAP70	ZEB1
observation_a	observation_b
TCGA-E9-A295-01	TCGA-AR-A1AS-01	0.375430	0.482697	0.025785	0.423335	0.226186	0.623157	0.489348	0.161659	0.930970	0.061216	...	0.319277	0.760959	0.56571	0.955370	0.177350	0.167693	0.304148	0.199462	0.638770	0.645654
TCGA-E9-A295-01	TCGA-AQ-A1H2-01	0.467019	0.447891	0.050915	0.221294	0.223247	0.167546	0.245124	0.014759	0.153712	0.205665	...	0.178091	0.559237	0.62304	0.764808	0.481764	0.582859	0.383708	0.218584	0.474194	0.261223

2 rows × 36 columns

Similarly, we can compute the Euclidean distance on feature neighborhoods. As demonstrated next, the Euclidean metric is specified by the attribute metric. Any metric that can be passed to scipy.spatial.distance.cdist is acceptable.

for dd in keeper.data:
    keeper.euc_distance_pairwise_observation_feature_nbhd(dd.label, dd.label, features=gene_sig,
                                                           include_self=False, label=gene_sig_label,
                                                           metric='euclidean')

Computing pairwise 1-hop nbhd Euclidean distances: 100%|███████████████████████████████████████████| 36/36 [00:01<00:00, 25.96it/s]

netflow.methods.classes: 05/01/2025 05:23:41  PM | MSG |                                                           
classes:multiple_pairwise_observation_neighborhood_euc_distance:961 | >>> Observation pairwise 1-hop neighborhood  
Euclidean distances on rna_immune_signature_nbhd_euc_without_self saved to                                         
rna_immune_signature_nbhd_euc_without_self.csv.

Computing pairwise 1-hop nbhd Euclidean distances: 100%|███████████████████████████████████████████| 36/36 [00:01<00:00, 27.24it/s]

netflow.methods.classes: 05/01/2025 05:23:52  PM | MSG |                                                           
classes:multiple_pairwise_observation_neighborhood_euc_distance:961 | >>> Observation pairwise 1-hop neighborhood  
Euclidean distances on meth_immune_signature_nbhd_euc_without_self saved to                                        
meth_immune_signature_nbhd_euc_without_self.csv.

Distance from a single feature neighborhood#

If we are interested in the distances with respect to a single feature neighborhood, for example CD28, we can extract the stacked-distance format from keeper.misc and add it to the distances in keeper.distances:

First we extract the stacked RNA CD28 neighborhood Wasserstein distances:

keeper.misc.keys()

dict_keys(['rna_immune_signature_nbhd_wass_without_self', 'meth_immune_signature_nbhd_wass_without_self', 'rna_immune_signature_nbhd_euc_without_self', 'meth_immune_signature_nbhd_euc_without_self'])

rna_CD28_nbhd_wass_dist = keeper.misc[f"rna_{gene_sig_label}_nbhd_wass_without_self"]['CD28']
rna_CD28_nbhd_wass_dist.head()

observation_a    observation_b  
TCGA-E9-A295-01  TCGA-AR-A1AS-01    0.226186
                 TCGA-AQ-A1H2-01    0.223247
                 TCGA-A8-A08O-01    0.246211
                 TCGA-BH-A1FJ-01    0.322797
                 TCGA-JL-A3YX-01    0.382493
Name: CD28, dtype: float64

Then we add the stacked distances to keeper.distances, keyed by 'wass_dist_rna_CD28_nbhd':

keeper.add_stacked_distance(rna_CD28_nbhd_wass_dist, 'wass_dist_rna_CD28_nbhd')

Distance from norm of feature neighborhood distances#

Alternatively, we can use the norm of the vector of neighborhood distances over all features as the final distance. First, we extract the norm of feature neighborhood distances, again demonstrated on the RNA-Wasserstein distances.

Summary of the arguments passed to compute the norm and save it as a distance:

misc_key : Specify the key of the stacked RNA feature neighborhood Wasserstein distnaces stored in keeper.misc
distance_label : Specify the key that the resulting distance is stored as in keeper.distances
features : Option to select subset of features to include. If provided, restrict to norm over columns corresponding to features in the specified list. If None, use all columns. (In this demonstration, we use the default behavior and include all features that neighborhood distances were computed on.)
method : Indicate which norm to compute, can be one of [‘L1’, ‘L2’, ‘inf’, ‘mean’, ‘median’]

misc_key = f"rna_{gene_sig_label}_nbhd_wass_without_self"
distance_label = f"norm_{misc_key}"
nf.methods.metrics.norm_features_as_sym_dist(keeper, misc_key,label=distance_label,
                                             features=None, method='L1')

We can now see this distance has been added to keeper.distances:

keeper.distances[distance_label].to_frame().head(2)

	TCGA-E9-A295-01	TCGA-AR-A1AS-01	TCGA-AQ-A1H2-01	TCGA-A8-A08O-01	TCGA-BH-A1FJ-01	TCGA-JL-A3YX-01	TCGA-A7-A425-01	TCGA-AC-A2BM-01	TCGA-LL-A6FP-01	TCGA-A7-A26E-01	...	TCGA-A2-A1G0-01	TCGA-WT-AB41-01	TCGA-EW-A1P6-01	TCGA-XX-A89A-01	TCGA-A7-A4SD-01	TCGA-AC-A6IX-01	TCGA-AR-A24L-01	TCGA-BH-A42U-01	TCGA-AR-A24S-01	TCGA-BH-A0BC-01
TCGA-E9-A295-01	0.000000	13.415799	11.976610	10.434216	11.961795	9.241838	9.865547	12.639423	17.064882	13.054158	...	11.651279	16.169011	10.006264	11.019874	13.043276	9.243581	12.245928	13.691054	11.829914	10.739983
TCGA-AR-A1AS-01	13.415799	0.000000	13.968681	9.741795	9.344761	9.340045	12.072010	12.957682	16.617421	10.944821	...	13.210496	19.160851	9.812397	13.671998	15.426071	11.771620	11.390057	17.613263	13.028777	11.599280

2 rows × 606 columns

Feature profile distances#

We can also compute distances between feature profiles of every two observations.

The Wasserstein and Euclidean distances between the selected gene signature feature profiles is computed for each feature data set (RNA and methylation) as follows:

for dd in keeper.data:    
    # Wasserstein distances:
    keeper.wass_distance_pairwise_observation_profile(dd.label, dd.label, 
                                                      features=gene_sig, label=gene_sig_label,                                                      
                                                      graph_distances=None, edge_weight=None)

    # Euclidean distances:
    keeper.euc_distance_pairwise_observation_profile(dd.label, features=gene_sig, label=gene_sig_label)                                                                  

Compute similarities from distances#

n_neighbors = 5

for dd in keeper.distances:
    keeper.compute_similarity_from_distance(dd.label, n_neighbors, 'max',                                                    
                                            label=None, knn=False)

We can see the labels of all the similarities that have been added to the keeper (one for each distance that was in the keeper):

print(*keeper.similarities.keys(), sep='\n')

similarity_max5nn_wass_dist_rna_CD28_nbhd
similarity_max5nn_norm_rna_immune_signature_nbhd_wass_without_self
similarity_max5nn_rna_immune_signature_profile_wass
similarity_max5nn_rna_immune_signature_profile_euc
similarity_max5nn_meth_immune_signature_profile_wass
similarity_max5nn_meth_immune_signature_profile_euc

Fused similarities#

We can take any combination of similarities and add them as a fused similarity to the keeper.

For example, we can fuse the Wasserstein and Euclidean RNA profile distances.

This can be done in two ways :

Explicitly, by computing the fused distance and then adding it to the keeper or,
implicitly, using the keeper’s method fuse_similarities. In this approach, you provide:
- similarity_keys : the list of keys referencing the similarities that should be fused
- weights (optional) : weights for how much each similarity should contribute to the fusion. The default behavior is to assign uniform weights and compute the average.
- fused_key (optional) : the key used to store the fused similarity in the keeper. The default behavior is to fuse the keys of the original similarities.

We demonstrate both options, uncomment one to perform.

Uncomment below for explicitly computing the fused similarity.

# fused_similarity = 0.5*(keeper.similarities['similarity_max5nn_rna_immune_signature_profile_wass'].data + \
#                         keeper.similarities['similarity_max5nn_rna_immune_signature_profile_euc'].data)

# fused_key = 'fused_similarity_max5nn_rna_immune_signature_profile_wass_euc'
# keeper.add_similarity(fused_similarity, fused_key)

Uncomment below to implicitly compute the fused similarity.

sim_keys = ['similarity_max5nn_rna_immune_signature_profile_wass', 'similarity_max5nn_rna_immune_signature_profile_euc']
fused_key = 'fused_similarity_max5nn_rna_immune_signature_profile_wass_euc'

keeper.fuse_similarities(sim_keys, 
                         # weights=None,  # uncomment to add fused weighting
                         fused_key=fused_key)

Note: If fused_key is not provided (fused_key = None), default behavior is to attempt to merge the similarity labels.

We can see that the fused similarity has been added to the Keeper with the key 'fused_similarity_max5nn_rna_immune_signature_profile_wass_euc':

print(*keeper.similarities.keys(), sep='\n')

similarity_max5nn_wass_dist_rna_CD28_nbhd
similarity_max5nn_norm_rna_immune_signature_nbhd_wass_without_self
similarity_max5nn_rna_immune_signature_profile_wass
similarity_max5nn_rna_immune_signature_profile_euc
similarity_max5nn_meth_immune_signature_profile_wass
similarity_max5nn_meth_immune_signature_profile_euc
fused_similarity_max5nn_rna_immune_signature_profile_wass_euc

Or, we can fuse the RNA and methylation Wasserstein profile distances:

keeper.fuse_similarities(['similarity_max5nn_rna_immune_signature_profile_wass',
                          'similarity_max5nn_meth_immune_signature_profile_wass'])

We can check that it has been added to the similarities, with the default fused key fused_similarity_max5nn_rna_meth_immune_signature_profile_wass:

'fused_similarity_max5nn_rna_meth_immune_signature_profile_wass' in keeper.similarities

True

Since the similarities range from 0 to 1 (0 being the least similar and 1 being the most similar), we can transform them into distances by taking 1-similarity.

Next, we add the distances from each similarity to keeper.distances, so they can be used in later analysis:

for dd in keeper.similarities:
    keeper.convert_similarity_to_distance(dd.label)

Compute diffusion distances#

Next we show how to compute the scale-free diffusion distance from a similarity. This entails computing the transition matrix associated with the similarity which is stored in the misc keeper.

To compute the diffusion distance for each similarity, we simply loop through the similarities stored in the keeper. See the documentation for a description of additional options.

for sim in keeper.similarities:
    keeper.compute_dpt_from_similarity(sim.label, density_normalize=True)

# for sim in keeper.similarities:
#     keeper.compute_transitions_from_similarity(sim.label, density_normalize=True)
#     T_sym_label = f"transitions_sym_{sim.label}"
#     keeper.compute_dpt_from_augmented_sym_transitions(T_sym_label)

Compute the POSE#

The POSE can be computed from any distance in the keeper. We demonstrate this next using the multi-scale Wasserstein distance between RNA immune signature profiles, with the reference key 'dpt_from_transitions_sym_similarity_max5nn_rna_immune_signature_profile_wass_density_normalized' in the keeper.

The POSER class is used to perform the iterative branching and keep track the branches and their splits. The final POSE, i.e., the Pseudo-Organizational StructurE of the observations is returned as a graph where each observation is represented as a node. The edges show the organizational relationship determined from the ordering (lineage tracing algorithm) and nearest neighbors (alleviates strict assumptions of lineage tracing that do not apply).

print(*[dd.label for dd in keeper.distances], sep='\n')

wass_dist_rna_CD28_nbhd
norm_rna_immune_signature_nbhd_wass_without_self
rna_immune_signature_profile_wass
rna_immune_signature_profile_euc
meth_immune_signature_profile_wass
meth_immune_signature_profile_euc
distance_from_similarity_max5nn_wass_dist_rna_CD28_nbhd
distance_from_similarity_max5nn_norm_rna_immune_signature_nbhd_wass_without_self
distance_from_similarity_max5nn_rna_immune_signature_profile_wass
distance_from_similarity_max5nn_rna_immune_signature_profile_euc
distance_from_similarity_max5nn_meth_immune_signature_profile_wass
distance_from_similarity_max5nn_meth_immune_signature_profile_euc
distance_from_fused_similarity_max5nn_rna_immune_signature_profile_wass_euc
distance_from_fused_similarity_max5nn_rna_meth_immune_signature_profile_wass
dpt_from_transitions_sym_similarity_max5nn_wass_dist_rna_CD28_nbhd_density_normalized
dpt_from_transitions_sym_similarity_max5nn_norm_rna_immune_signature_nbhd_wass_without_self_density_normalized
dpt_from_transitions_sym_similarity_max5nn_rna_immune_signature_profile_wass_density_normalized
dpt_from_transitions_sym_similarity_max5nn_rna_immune_signature_profile_euc_density_normalized
dpt_from_transitions_sym_similarity_max5nn_meth_immune_signature_profile_wass_density_normalized
dpt_from_transitions_sym_similarity_max5nn_meth_immune_signature_profile_euc_density_normalized
dpt_from_transitions_sym_fused_similarity_max5nn_rna_immune_signature_profile_wass_euc_density_normalized
dpt_from_transitions_sym_fused_similarity_max5nn_rna_meth_immune_signature_profile_wass_density_normalized

# key = 'distance_from_fused_similarity_max5nn_rna_meth_immune_signature_profile_wass'
# key = 'dpt_from_transitions_sym_similarity_max5nn_norm_rna_immune_signature_nbhd_wass_without_self_density_normalized'
# key = 'dpt_from_transitions_sym_similarity_max5nn_meth_immune_signature_profile_wass_density_normalized'
# key = 'dpt_from_transitions_sym_similarity_max5nn_rna_immune_signature_profile_euc_density_normalized'
# key = 'dpt_from_transitions_sym_similarity_max5nn_meth_immune_signature_profile_euc_density_normalized'
# key = 'dpt_from_transitions_sym_similarity_max5nn_rna_immune_signature_profile_wass_density_normalized' # ***
# key = 'dpt_from_transitions_sym_fused_similarity_max5nn_rna_immune_signature_profile_wass_euc_density_normalized'
key = 'dpt_from_transitions_sym_similarity_max5nn_wass_dist_rna_CD28_nbhd_density_normalized'

poser, G_poser_nn = keeper.construct_pose(key, n_branches=5, min_branch_size=10,
                                          until_branched=True, verbose='ERROR')

We briefly go over some of the main arguments for consideration when computing the pose:

root : You can specify which observation should be used as the root or the method to select the root. The default is to set the root as the central observation, determined as the observation with the smallest average distance to all other observations.
min_branch_size : The minimum number of observations a branch must have to be considered for branching during the recursive branching procedure.
split : If True, a branch is split according to its three tips. Otherwise, split the branch by determining a single branching off the main segment with respect to the tip farthest from the root.
n_branches : The number of times to iteratively perform the branching, unless the process terminates earlier.
until_branched : This is intended for compatibility with the scanpy implementation, where branching is iteratively attempted n_branches times. However, some iterations may not result in any further branching (e.g., if no split is found). We add the option (by setting until_branched = True) so that such iterations that do not result in further branching do not count toward the number of iterations performed. Thus, when until_branched = True, the iteration does not terminate until either a branch has been split or the process terminates.

See the documentation for a description of all possible arguments.

A closer look at the POSER#

First, we can get the index of the observation that was used as the root:

poser.root

We can also get the label of the root observation:

keeper.observation_labels[poser.root]

'TCGA-OL-A5RW-01'

We can get an overview of the branching:

poser.tree.root.disp()

            ┌1 (n = 58)
            ├2 (n = 60)
            │                   ┌0 - trunk (n = 13)
            │                   ├4 (n = 2)
            ├0 - trunk (n = 113)┤
            │                   ├6 (n = 97)
            │                   └5 (n = 1)
 0 (n = 606)┤
            │           ┌3 - trunk (n = 20)
            │           ├7 (n = 17)
            │           ├9 (n = 4)
            └3 (n = 375)┤
                        │           ┌8 - trunk (n = 53)
                        │           ├10 (n = 14)
                        │           ├11 (n = 248)
                        └8 (n = 334)┤
                                    │           ┌12 - trunk (n = 11)
                                    │           ├13 (n = 1)
                                    └12 (n = 19)┤
                                                ├15 (n = 2)
                                                └14 (n = 5)

We can get a list of the branches that were split in the order in which they were split:

poser.branched_ordering

[0 (n = 606), 0 - trunk (n = 113), 3 (n = 375), 8 (n = 334), 12 (n = 19)]

Visualizing the POSE#

The topology of the POSE is return as G_poser_nn.

We highlight some of the node and edge attributes

Node attributes:

'branch' : Reference ID of the branch the observation belongs to.
'name' : Label of the observation the node corresponds to.

Edge attributes:

'connection’ : Takes one of the following values:
- 'intra-branch' if the edge connects two observations in the same branch and
- 'inter-branch' if the edge connects two observations in different branches.
'distance' :
'edge_origin' : Takes one of the following values:
- 'POSE + NN'
- 'POSE'
- 'NN'

We can get a record of the branch that each observation is assigned to:

branch_record = pd.Series(dict(G_poser_nn.nodes.data(data='branch', 
                                                     default=-1))).rename(index=dict(G_poser_nn.nodes.data(data='name')))
branch_record.name = key

And then we can see the number of observations in each branch:

branch_record.value_counts()

  248
    97
    60
    58
    53
    20
    17
   14
    13
   11
    5
     4
     2
    2
     1
    1
Name: dpt_from_transitions_sym_similarity_max5nn_wass_dist_rna_CD28_nbhd_density_normalized, dtype: int64

Kaplan-Meier survival analysis#

Next, we perform Kaplan-Meier survival analysis grouping the observations by their branch.

Note: We point out that the branches provide an ordering (and not a partition). While branching captures breaks in relational structure of the observations, it does not necessarily partition the observations into hetergeneous clusters where the clusters are made up of homogeneous observations. Further clustering is recommended

T = clin['OS_MONTHS'].copy()
print(T.max())

E = clin['OS_STATUS'].copy()
E = E.replace({'0:LIVING': 0, '1:DECEASED': 1})
display(E.value_counts())

TE = pd.concat([T, E], axis=1)

282.69

0    530
1     76
Name: OS_STATUS, dtype: int64

cut = 12.*10.
Tcut = T.copy()
Tcut[Tcut > cut] = cut
print(Tcut.max())

Ecut = E.copy()
Ecut.loc[(T > cut) & (E==1)] = 0
display(Ecut.value_counts())

TEcut = pd.concat([Tcut, Ecut], axis=1)

120.0

0    534
1     72
Name: OS_STATUS, dtype: int64

colors = [cm.to_hex(plt.cm.tab10(i)) for i in range(10)] if branch_record.nunique() <=10 else [cm.to_hex(plt.cm.tab20(i)) for i in range(20)]

if branch_record.nunique() <= 20:
    color_cycler = colors
else:
    norm = matplotlib.colors.Normalize(vmin=min(branch_record.unique()), 
                                       vmax=max(branch_record.unique()), clip=True)
    mapper = matplotlib.cm.ScalarMappable(norm=norm, cmap=matplotlib.cm.gist_rainbow)
    color_cycler = [mapper.to_rgba(v) for v in sorted(branch_record.unique())]

nfv.KM_between_groups(TEcut['OS_MONTHS'], TEcut['OS_STATUS'], 
                      branch_record, min_group_size=3,
                      figsize=(7, 4), show_censors=True, ci_show=False,
                      show_at_risk_counts=False,
                      ttl=f"POSE branching\n{key}",                      
                      colors=dict(zip(sorted(branch_record.unique()), itertools.cycle(color_cycler))), 
                      xlabel='time (months)', linewidth=2,
                      censor_styles={'ms': 5, 'marker': '|', # 'markeredgecolor': '#ff0000',
                                    },
                      )

../../_images/9a72a8318fe052d984c238659f1dce0f60b809a5a124e45c51fdbcc25bffaff1.png

Visualize POSE#

pos = nx.layout.kamada_kawai_layout(G_poser_nn)

First determine features to color nodes by:

root_obs = poser.root
nodelist = list(G_poser_nn)
edgelist = list(G_poser_nn.edges())
node_labels = {k: G_poser_nn.nodes[k]['name'] for k in nodelist}
edge_style = [G_poser_nn.edges[ee]['edge_origin'] for ee in edgelist]
node_size = [40 if k== root_obs else 30 for k in nodelist]
# node_shape = clin['Dead'].replace({False: 'alive', True: 'deceased'}).loc[[node_labels[kk] for kk in nodelist]].values.tolist()
# node_shape_response = clin['Response'].loc[[node_labels[kk] for kk in nodelist]].values.tolist()
node_shape_pam50 = clin['PAM50'].loc[[node_labels[kk] for kk in nodelist]].values.tolist()
node_shape_os = TEcut['OS_STATUS'].replace({1: 'Deceased', 
                                            0: 'Alive'}).loc[[node_labels[kk] for kk in nodelist]].values.tolist()

nsm_pam50 = {'Normal': '^', 'LumA': 'd', 'Basal': 'o', 'LumB': 'p', 'Her2': 's'}
nsm_os = {'Alive': 's', 'Deceased': 'o'}

node_colors_record = {}
node_colors_record[f'branch'] = [G_poser_nn.nodes[k]['branch'] for k in nodelist]
node_colors_record[f'pseudo-distance to root'] = [poser.distances[poser.root, k] for k in nodelist] 
node_colors_record['OS'] = TEcut['OS_MONTHS'].loc[[node_labels[kk] for kk in nodelist]].values.tolist() 
node_colors_record[f"mean {gene_sig_label} rna signature"] = keeper.data['rna'].subset(features=list(set(gene_sig) & set(keeper.graphs['rna']))).mean(axis=0).values.tolist()
# node_colors_record['CD28 rna'] = keeper.data['rna'].subset(features=['CD28']).loc['CD28'].loc[[node_labels[kk] for kk in nodelist]].values.tolist()

for gn in ['CD28', 'ITK', 'LCK', 'CD4', 'CD247','CD80','CD86', 'ITGB7']: # ['CD28', 'CD4', 'CCR5', 'CD3G', 'LCP2', 'LCK', 'ITK', 'ITGB7']: # ['CD28', 'GZMB', 'TGFBR2', 'TNF']:
    node_colors_record[gn] = keeper.data['rna'].subset(features=[gn]).loc[gn].loc[[node_labels[kk] for kk in nodelist]].values.tolist()

fig, axes = plt.subplots(3, 4, figsize=(20, 20), # gridspec_kw={'hspace':0.001, 'wspace': 0.02},
                         layout='constrained')
fig.dpi = 300
fontsize = 11 # 42
fig.suptitle(f"POSE: {key}\nnodes colored by:", fontsize=fontsize);

for (ttl, nc), cmap, node_shape, nsm, ncde, ntm, ax in zip(node_colors_record.items(), 
                                                           ['tab10' if branch_record.nunique() <= 10 else 'tab20' if branch_record.nunique() <=20 else 'gist_rainbow'] + ['nipy_spectral']*(len(node_colors_record)-1),
                                                           [node_shape_pam50]*2 + [node_shape_os] + [node_shape_pam50]*(len(node_colors_record)-3),
                                                           [nsm_pam50]*2 + [nsm_os] + [nsm_pam50]*(len(node_colors_record)-3),
                                                           [True] + [None]*(len(node_colors_record)-1),
                                                           [None]*len(node_colors_record), #  + [{0: 'NIV3-NAIVE', 1: 'NIV3-PROG'}] + [None]*3,
                                                           np.ravel(axes)):
    ax = nfv.plot_topology(G_poser_nn, pos=pos, ax=ax,
                           with_node_labels=False, node_font_size=2, with_edge_labels=False,
                           node_cbar=True,
                           nodelist=nodelist, edgelist=edgelist, node_cmap=cmap,
                           node_size=node_size, node_color=nc, edge_color='lightgray',                   
                           node_shape=node_shape, node_shape_mapper=nsm, 
                           border_linewidths=0., # 0.5, 
                           bordercolors='k', edge_width=1.0, edge_style=edge_style, 
                           edge_style_mapper={'POSE + NN': '-', 'POSE': '--', 'NN': ':'},
                           node_labels=node_labels, edge_labels=None,
                           node_cbar_label=ttl,
                           node_cmap_drawedges=ncde,
                           node_ticklabels_mapper=ntm,
                          )        

../../_images/0c572002a60431b114d23d25c31fb7657424e37e2100c768260354e3a9de8eb3.png

root_obs = poser.root
nodelist = list(G_poser_nn)
edgelist = list(G_poser_nn.edges())
node_labels = {k: G_poser_nn.nodes[k]['name'] for k in nodelist}
edge_style = [G_poser_nn.edges[ee]['edge_origin'] for ee in edgelist]
node_size = [40 if k== root_obs else 30 for k in nodelist]
# node_shape = clin['Dead'].replace({False: 'alive', True: 'deceased'}).loc[[node_labels[kk] for kk in nodelist]].values.tolist()
# node_shape_response = clin['Response'].loc[[node_labels[kk] for kk in nodelist]].values.tolist()
node_shape_pam50 = clin['PAM50'].loc[[node_labels[kk] for kk in nodelist]].values.tolist()
node_shape_os = TEcut['OS_STATUS'].replace({1: 'Deceased', 
                                            0: 'Alive'}).loc[[node_labels[kk] for kk in nodelist]].values.tolist()

nsm_pam50 = {'Normal': '^', 'LumA': 'd', 'Basal': 'o', 'LumB': 'p', 'Her2': 's'}
nsm_os = {'Alive': 's', 'Deceased': 'o'}

node_colors_record = {}
node_colors_record[f'branch'] = [G_poser_nn.nodes[k]['branch'] for k in nodelist]
node_colors_record[f'pseudo-distance to root'] = [poser.distances[poser.root, k] for k in nodelist] 
node_colors_record['OS'] = TEcut['OS_MONTHS'].loc[[node_labels[kk] for kk in nodelist]].values.tolist() 
node_colors_record[f"mean {gene_sig_label} rna signature"] = keeper.data['rna'].subset(features=list(set(gene_sig) & set(keeper.graphs['rna']))).mean(axis=0).values.tolist()
# node_colors_record['CD28 rna'] = keeper.data['rna'].subset(features=['CD28']).loc['CD28'].loc[[node_labels[kk] for kk in nodelist]].values.tolist()

for gn in ['CD28', 'CD4', 'CCR5', 'CD3G', 'LCP2', 'LCK', 'ITK', 'ITGB7']: # ['CD28', 'GZMB', 'TGFBR2', 'TNF']: 
    # ['CD28', 'ITK', 'LCK', 'CD4', 'CD247','CD80','CD86']
    
    node_colors_record[gn] = keeper.data['rna'].subset(features=[gn]).loc[gn].loc[[node_labels[kk] for kk in nodelist]].values.tolist()

fig, axes = plt.subplots(3, 4, figsize=(20, 20), # gridspec_kw={'hspace':0.001, 'wspace': 0.02},
                         layout='constrained')
fig.dpi = 300
fontsize = 11 # 42
fig.suptitle(f"POSE: {key}\nnodes colored by:", fontsize=fontsize);

for (ttl, nc), cmap, node_shape, nsm, ncde, ntm, ax in zip(node_colors_record.items(), 
                                                           ['tab10' if branch_record.nunique() <= 10 else 'tab20' if branch_record.nunique() <=20 else 'gist_rainbow'] + ['nipy_spectral']*(len(node_colors_record)-1),
                                                           [node_shape_pam50]*2 + [node_shape_os] + [node_shape_pam50]*(len(node_colors_record)-3),
                                                           [nsm_pam50]*2 + [nsm_os] + [nsm_pam50]*(len(node_colors_record)-3),
                                                           [True] + [None]*(len(node_colors_record)-1),
                                                           [None]*len(node_colors_record), #  + [{0: 'NIV3-NAIVE', 1: 'NIV3-PROG'}] + [None]*3,
                                                           np.ravel(axes)):
    ax = nfv.plot_topology(G_poser_nn, pos=pos, ax=ax,
                           with_node_labels=False, node_font_size=2, with_edge_labels=False,
                           node_cbar=True,
                           nodelist=nodelist, edgelist=edgelist, node_cmap=cmap,
                           node_size=node_size, node_color=nc, edge_color='lightgray',                   
                           node_shape=node_shape, node_shape_mapper=nsm, 
                           border_linewidths=0., # 0.5, 
                           bordercolors='k', edge_width=1.0, edge_style=edge_style, 
                           edge_style_mapper={'POSE + NN': '-', 'POSE': '--', 'NN': ':'},
                           node_labels=node_labels, edge_labels=None,
                           node_cbar_label=ttl,
                           node_cmap_drawedges=ncde,
                           node_ticklabels_mapper=ntm,
                          )        

../../_images/7509c6b0d8ba027b46e59cfb05c11fa1a330a649578a994708e2e3aa8136b2d9.png

node_colors_record = {}
node_colors_record[f'branch'] = [G_poser_nn.nodes[k]['branch'] for k in nodelist]
for gn in gene_sig:
    node_colors_record[gn] = keeper.data['rna'].subset(features=[gn]).loc[gn].loc[[node_labels[kk] for kk in nodelist]].values.tolist()

fig, axes = plt.subplots(7,6, figsize=(21,21), # gridspec_kw={'hspace':0.001, 'wspace': 0.02},
                         layout='constrained')
fig.dpi = 300
fontsize = 11 # 42
fig.suptitle(f"POSE: {key}\nnodes colored by:", fontsize=fontsize);
node_size=5
for (ttl, nc), cmap, node_shape, nsm, ncde, ntm, ax in zip(node_colors_record.items(), 
                                                           ['tab10' if branch_record.nunique() <= 10 else 'tab20' if branch_record.nunique() <=20 else 'gist_rainbow'] + ['nipy_spectral']*(len(node_colors_record)-1),
                                                           [node_shape_pam50]*len(node_colors_record), # 2 + [node_shape_os] + [node_shape_pam50]*6,
                                                           [nsm_pam50]*len(node_colors_record), # *2 + [nsm_os] + [nsm_pam50]*6,
                                                           [True] + [None]*len(node_colors_record), # *8,
                                                           [None]*len(node_colors_record), # *8, #  + [{0: 'NIV3-NAIVE', 1: 'NIV3-PROG'}] + [None]*3,
                                                           np.ravel(axes)):
    ax = nfv.plot_topology(G_poser_nn, pos=pos, ax=ax,
                           with_node_labels=False, node_font_size=2, with_edge_labels=False,
                           node_cbar=True,
                           nodelist=nodelist, edgelist=edgelist, node_cmap=cmap,
                           node_size=node_size, node_color=nc, edge_color='lightgray',                   
                           node_shape=node_shape, node_shape_mapper=nsm, 
                           border_linewidths=0., # 0.5, 
                           bordercolors='k', edge_width=1.0, edge_style=edge_style, 
                           edge_style_mapper={'POSE + NN': '-', 'POSE': '--', 'NN': ':'},
                           node_labels=node_labels, edge_labels=None,
                           node_cbar_label=ttl,
                           node_cmap_drawedges=ncde,
                           node_ticklabels_mapper=ntm,
                           edge_show_legend=False, node_show_legend=False,
                          )

../../_images/e5cb37e79f4330e23c3ff5d55f4728248fb0374e5d29bb18c18a5e9a5307c3f6.png

node_colors_record = {}
node_colors_record[f'branch'] = [G_poser_nn.nodes[k]['branch'] for k in nodelist]
for gn in ['CD28'] + list(keeper.graphs['rna'].neighbors('CD28')): # gene_sig:
    node_colors_record[gn] = keeper.data['rna'].subset(features=[gn]).loc[gn].loc[[node_labels[kk] for kk in nodelist]].values.tolist()

fig, axes = plt.subplots(3, 7, figsize=(16,9), # gridspec_kw={'hspace':0.001, 'wspace': 0.02},
                         layout='constrained')
fig.dpi = 300
fontsize = 11 # 42
fig.suptitle(f"POSE: {key}\nnodes colored by:", fontsize=fontsize);
node_size=5
for (ttl, nc), cmap, node_shape, nsm, ncde, ntm, ax in zip(node_colors_record.items(), 
                                                           ['tab10' if branch_record.nunique() <= 10 else 'tab20' if branch_record.nunique() <=20 else 'gist_rainbow'] + ['nipy_spectral']*(len(node_colors_record)-1),
                                                           [node_shape_pam50]*len(node_colors_record), # 2 + [node_shape_os] + [node_shape_pam50]*6,
                                                           [nsm_pam50]*len(node_colors_record), # *2 + [nsm_os] + [nsm_pam50]*6,
                                                           [True] + [None]*len(node_colors_record), # *8,
                                                           [None]*len(node_colors_record), # *8, #  + [{0: 'NIV3-NAIVE', 1: 'NIV3-PROG'}] + [None]*3,
                                                           np.ravel(axes)):
    ax = nfv.plot_topology(G_poser_nn, pos=pos, ax=ax,
                           with_node_labels=False, node_font_size=2, with_edge_labels=False,
                           node_cbar=True,
                           nodelist=nodelist, edgelist=edgelist, node_cmap=cmap,
                           node_size=node_size, node_color=nc, edge_color='lightgray',                   
                           node_shape=node_shape, node_shape_mapper=nsm, 
                           border_linewidths=0., # 0.5, 
                           bordercolors='k', edge_width=1.0, edge_style=edge_style, 
                           edge_style_mapper={'POSE + NN': '-', 'POSE': '--', 'NN': ':'},
                           node_labels=node_labels, edge_labels=None,
                           node_cbar_label=ttl,
                           node_cmap_drawedges=ncde,
                           node_ticklabels_mapper=ntm,
                           edge_show_legend=False, node_show_legend=False,
                          )

../../_images/7e553f5e03887959ca13cf73fa16149d8673312fc483233c2aea4b032d344a56.png