Keeper tutorial#

The primary data structure used in netflow is called a Keeper. It is used to load, store, manipulate and save data for a set of observations. In particular, there are several specific types of keepers:

  • DataKeeper : handles feature data

  • DistanceKeeper : handles pairwise-observation distances (also used to handle pairwise observation similarities)

  • GraphKeeper : handles graphs (networks)

Interacting with netflow will primarily entail making use of the predomenent Keeper, which implicitly makes use of the aforementioned specific keeper classes. This tutorial therefore focuses on the Keeper class, please see the documentation for more detail on the other Keeper classes.

Data is organized in the Keeper class via the following attributes:

  • self.oudir : (directory path) : Path to directory where results will be saved.

    • If not provided, no results can be saved.

  • self.observation_labels : (list) Observation labels are kept consistent across all feature data, distances and similarities.

  • self.data : (DataKeeper) Used to handle all feature data.

  • self.distances : (DistanceKeeper) Used to handle all observation-pairwise distances.

  • self.similarities : (DistanceKeeper) Used to handle all observation-pairwise similarities.

  • self.graphs : (GraphKeeper) Used to handle all graphs.

  • self.misc : (dict) Used to handle any miscellaneous data.

    • Caution should be taken as observation labels and/or ordering of data stored in self.misc may not be consistent with the observations as tracked by the Keeper.

We will now walk through some use-cases of how to make use of the Keeper class.

First, import the necessary packages:

Load libraries#

import pathlib
import sys

from collections import defaultdict as ddict
import itertools
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd
import scipy.sparse as sc_sparse
from tqdm import tqdm

If netflow has not been installed, add the path to the library:

sys.path.insert(0, pathlib.Path(pathlib.Path('.').absolute()).parents[3].resolve().as_posix())
# sys.path.insert(0, pathlib.Path(pathlib.Path('.').absolute()).parents[0].resolve().as_posix())

From the netflow package, we load the following modules:

  • The InfoNet class is used to compute 1-hop neighborhood distances

  • The Keeper class is used to store and manipulate data/results

import netflow as nf

# from netflow.keepers import keeper 

Set up directories#

MAIN_DIR = pathlib.Path('.').absolute()

Paths to where data is stored:

DATA_DIR = MAIN_DIR / 'example_data' / 'breast_tcga'

RNA_FNAME = DATA_DIR / 'rna_606.txt'
E_RNA_FNAME = DATA_DIR / 'edgelist_hprd_rna_606.txt'

CNA_FNAME = DATA_DIR / 'cna_606.txt'
E_CNA_FNAME = DATA_DIR / 'edgelist_hprd_cna_606.txt'

METH_FNAME = DATA_DIR / 'methylation_606.txt'
E_METH_FNAME = DATA_DIR / 'edgelist_hprd_methylation_606.txt'

CLIN_FNAME = DATA_DIR / 'clin_606.txt'

Directory where output should be saved:

OUT_DIR = MAIN_DIR / 'example_data' / 'results_netflow_breast_tcga'

Load data#

We first load example TCGA breast cancer data that will be used to demonstrate how data can be loaded into the Keeper.

Sample inclusion criteria (n=606):

  • Restricted to

    • Primary samples

    • Cancer type: Breast Cancer

    • Detailed cancer type: Breast Invasive Ductal Carcinoma (IDC) or Breast Invasive Lobular Carcinoma (ILC)

  • Has reported overall survival status

  • Overall survival > 3.5 months (Figure 1)

    OS histogram

    • distribution of age of patients excluded is similar to distribution of the age of all samples (Figure 2)

    age_histogram

  • Has reported PAM50 subtype

  • Has RNA, CNA and methylation data

  • Feature data:

    • RNA (7,148 genes)

      • Remove genes that have zero expression in at least 20% (=121) of samples

      • When multiple Entrez IDs map to the same gene, use the sum

      • Restrict to largest connected component of genes in HPRD

    • CNA (8,763 genes)

      • When multiple Entrez IDs map to the same gene, select the more extreme value (i.e., in absolute value). If the absolute values are the same, select the loss.

      • Restrict to largest connected component of genes in HPRD

      • Translate from [-2, 2] -> [0, 4]

    • methylation (7,969 genes)

      • Missing rate is at most 5.12% - used nearest-neighbor imputation computed in Python using the scikit-learn KNNImputer with default parameters except weights=“distance”.

      • When multiple Entrez IDs map to the same gene, select the max.

      • Restrict to largest connected component of genes in HPRD

  • Network edgelists (derived from HPRD):

    • RNA edgelist (7,148 nodes and 27,498 edges)

    • CNA edgelist (8,763 nodes and 34,906 edges)

    • methylation edgelist (7,969 nodes and 31,112 edges)

rna = pd.read_csv(RNA_FNAME, header=0, index_col=0)
print(rna.shape)

display(rna.head(2))
(7148, 606)
TCGA-E9-A295-01 TCGA-AR-A1AS-01 TCGA-AQ-A1H2-01 TCGA-A8-A08O-01 TCGA-BH-A1FJ-01 TCGA-JL-A3YX-01 TCGA-A7-A425-01 TCGA-AC-A2BM-01 TCGA-LL-A6FP-01 TCGA-A7-A26E-01 ... TCGA-A2-A1G0-01 TCGA-WT-AB41-01 TCGA-EW-A1P6-01 TCGA-XX-A89A-01 TCGA-A7-A4SD-01 TCGA-AC-A6IX-01 TCGA-AR-A24L-01 TCGA-BH-A42U-01 TCGA-AR-A24S-01 TCGA-BH-A0BC-01
ABCB8 910.2982 624.0432 422.2278 659.3773 836.3215 1202.2298 1264.0832 759.0133 2768.8172 560.3929 ... 840.2367 1074.9211 769.4139 922.0595 597.0149 863.0025 451.0316 795.7087 606.1712 950.1840
SAT1 2727.0313 2731.2022 3701.4551 2259.7105 3085.1455 3915.4537 3230.3552 2042.0907 3258.4005 4837.8020 ... 1728.7968 6252.3659 2843.2332 7392.0642 3602.4876 3878.9536 2552.8859 5193.3684 5891.7026 3594.8176

2 rows × 606 columns

E_rna = pd.read_csv(E_RNA_FNAME, header=0)
display(E_rna.head(2))

G_rna = nx.from_pandas_edgelist(E_rna)
G_rna.name = 'rna'
print(G_rna)
source target
0 ABCB8 SAT1
1 SAT1 APLP1
Graph named 'rna' with 7148 nodes and 27498 edges
cna = pd.read_csv(CNA_FNAME, header=0, index_col=0)
print(cna.shape)

display(cna.head(2))
(8763, 606)
TCGA-E9-A295-01 TCGA-AR-A1AS-01 TCGA-AQ-A1H2-01 TCGA-A8-A08O-01 TCGA-BH-A1FJ-01 TCGA-JL-A3YX-01 TCGA-A7-A425-01 TCGA-AC-A2BM-01 TCGA-LL-A6FP-01 TCGA-A7-A26E-01 ... TCGA-A2-A1G0-01 TCGA-WT-AB41-01 TCGA-EW-A1P6-01 TCGA-XX-A89A-01 TCGA-A7-A4SD-01 TCGA-AC-A6IX-01 TCGA-AR-A24L-01 TCGA-BH-A42U-01 TCGA-AR-A24S-01 TCGA-BH-A0BC-01
ABCB8 2 2 1 2 2 2 3 2 3 2 ... 3 2 2 2 1 2 2 2 2 2
SAT1 2 2 1 1 2 1 2 1 2 2 ... 3 2 2 1 1 2 2 2 2 2

2 rows × 606 columns

E_cna = pd.read_csv(E_CNA_FNAME, header=0)
display(E_cna.head(2))

G_cna = nx.from_pandas_edgelist(E_cna)
G_cna.name = 'cna'
print(G_cna)
source target
0 ABCB8 SAT1
1 SAT1 APLP1
Graph named 'cna' with 8763 nodes and 34906 edges

Initialize the Keeper#

The Keeper can be instatiated with or without outdir - an output directory:

# uncomment to initialize Keeper with no output directory:
# keeper = nf.Keeper() 

# initializing Keeper with output directory:
keeper = nf.Keeper(outdir=OUT_DIR)

See the documentation for more details on initializing the Keeper.

Load data into the Keeper#

Currently, data is expected to be either in the form of a numpy.ndarray or pandas.DataFrame, or saved in a file that is loadable via pandas.read_csv, with observations as columns and features as rows.

Data can be loaded into the Keeper in two ways: i. From a numpy.ndarray or pandas.DataFrame or ii. Directly from a file that is loadable via pandas.read_csv

Note: Future releases will handle various file types.

keeper.add_data(rna, 'rna')

The first time data is loaded into keeper, it sets the observation labels that must be consistent with any other feature data set, pairwise-observation distances data and pairwise-observation similarities data that will be added to keeper here on out.

If instead the feature data was provided as a numpy.ndarray, observation labels default to 'X0', 'X1', ....

Within the netflow environment, this data can be specified by specifying the data set’s reference label or key, e.g., key='rna'.

We see that keeper has updated the observation labels:

# uncomment to print out all observation labels:
# keeper.observation_labels

We can also upload the cna data:

keeper.add_data(cna, 'cna')

Similarly, we can add the graphs associated with the RNA data:

keeper.add_graph(G_rna, 'rna')

If we had pairwise-observation distances or similarities, we could add them to keeper in the same manner using the methods:

  • keeper.add_distance()

  • keeper.add_similarity()

Suppose we had some miscellaneous data, for example a list of genes of interest, that we wanted to store in the Keeper, this could be done as follows:

gene_list = ['SLC3A1', 'TIMM13', 'STYX', 'MUC6', 'TIMM17B', 'TNFSF18', 'TOMM7', 'TOMM40',
             'TRAPPC2L', 'AZU1', 'TUBB1', 'XCR1', 'VPS33B', 'XCL2', 'KNTC1']
keeper.add_misc(gene_list, 'my_gene_list')

Load data into keeper at time of instatiation#

Alternatively, you can specify any data, distances, similarities, graphs, or miscelaneous data to be loaded into the Keeper at the same time as it’s initialized. For example, we initialize the Keeper and load the RNA and CNA feature data and their associated graphs in one call:

keeper = nf.Keeper(data={'rna': rna, 'cna': cna}, graphs={'rna': G_rna, 'cna': G_cna})

See the documentation on additional options for initializing and loading data into a Keeper.

Load data into Keeper from file#

Alternatively, we can load data, distances, similarities, and graphs directly from a file.

Currently, file formats that can be loaded by pandas.read_csv are accepted. Future releases will offer additional file types.

We start by initializing a Keeper and then load the RNA and CNA data from file. The argument label is used to specify how the data is referenced in the Keeper.

keeper = nf.Keeper(outdir=OUT_DIR)
# Add RNA and CNA feature data:
keeper.load_data(RNA_FNAME, label='rna', header=0, index_col=0) 
keeper.load_data(CNA_FNAME, label='cna', header=0, index_col=0, dtype=float)

# uncomment to add methylation feature data:
# keeper.load_data(METH_FNAME, label='meth', header=0, index_col=0)

We next add the graphs associated with RNA and CNA. (See the documentation for more details and the expected edgelist format.)

keeper.load_graph(E_RNA_FNAME, label='rna')
keeper.load_graph(E_CNA_FNAME, label='cna')

# uncomment to add methylation graph:
# keeper.load_graph(E_METH_FNAME, label='meth')

Similarities and distances can be loaded from file into the Keeper in the same manner via:

  • keeper.load_distance() or keeper.load_stacked_distance()

  • keeper.load_similarity() or keeper.load_stacked_similarity()

See the documentaion for more details.

Extract data from the Keeper#

We can extract the data in the form of a pandas.DataFrame as follows:

df = keeper.data['rna'].to_frame()
df.head(3)
TCGA-E9-A295-01 TCGA-AR-A1AS-01 TCGA-AQ-A1H2-01 TCGA-A8-A08O-01 TCGA-BH-A1FJ-01 TCGA-JL-A3YX-01 TCGA-A7-A425-01 TCGA-AC-A2BM-01 TCGA-LL-A6FP-01 TCGA-A7-A26E-01 ... TCGA-A2-A1G0-01 TCGA-WT-AB41-01 TCGA-EW-A1P6-01 TCGA-XX-A89A-01 TCGA-A7-A4SD-01 TCGA-AC-A6IX-01 TCGA-AR-A24L-01 TCGA-BH-A42U-01 TCGA-AR-A24S-01 TCGA-BH-A0BC-01
ABCB8 910.2982 624.0432 422.2278 659.3773 836.3215 1202.2298 1264.0832 759.0133 2768.8172 560.3929 ... 840.2367 1074.9211 769.4139 922.0595 597.0149 863.0025 451.0316 795.7087 606.1712 950.1840
SAT1 2727.0313 2731.2022 3701.4551 2259.7105 3085.1455 3915.4537 3230.3552 2042.0907 3258.4005 4837.8020 ... 1728.7968 6252.3659 2843.2332 7392.0642 3602.4876 3878.9536 2552.8859 5193.3684 5891.7026 3594.8176
ABCF3 1310.1533 1381.3597 1413.9488 1078.8295 992.8420 1273.1496 1344.0976 1256.8570 2183.8038 1067.4277 ... 1055.8843 2185.3312 1145.1250 1031.6486 1705.9701 1292.1338 924.5234 1223.4741 1003.6413 1278.0836

3 rows × 606 columns

Similarly, this can be done for any key in the distance or similarity keeper in the form:

  • keeper.distances[key].to_frame()

  • keeper.similarities[key].to_frame()

Next, we demonstrate how to get a graph that has been stored in keeper, keyed by its reference label:

G = keeper.graphs['rna']
print(G)
Graph named 'rna' with 7148 nodes and 27498 edges

Similarly, we can access miscellaneous data stored in keeper:

# first we add the gene list to the keeper:
keeper.add_misc(gene_list, 'my_gene_list')
# and to access the gene list:
keeper.misc['my_gene_list']
['SLC3A1',
 'TIMM13',
 'STYX',
 'MUC6',
 'TIMM17B',
 'TNFSF18',
 'TOMM7',
 'TOMM40',
 'TRAPPC2L',
 'AZU1',
 'TUBB1',
 'XCR1',
 'VPS33B',
 'XCL2',
 'KNTC1']

Save data from the Keeper#

Data can be saved from:

  • keeper.data via keeper.save_data()

  • keeper.distances via keeper.save_distance()

  • keeper.similarities via keeper.save_similarity()

  • keeper.misc (if it is a pandas.DataFrame) via keeper.save_misc()

Currently, it is saved using pandas.to_csv. Future releases will provide additional formats for saving data.

In order to save data, keeper must have the attribute outdir defined.

The data is then saved to a file named {outdir}/{data_type}_{label}.{file_format} where

  • outdir : The keeper’s output directory: keeper.outdir.

  • data_type : This is one of {‘data’, ‘distance’, ‘similarity’, ‘misc’}, depending on which method is called to save the data.

  • label : This is reference label for the data in the keeper that should be saved, specified by the user.

  • file_format This is the file extension, provided by the user (default = ‘txt’).

See the documentation for more details.

For example, to save the RNA data to the specified output directory, you would call:

keeper.save_data('rna')

Extract Keeper subset#

You can a new Keeper instance with a subset of the observations from the keeper.

Caution should be taken for the resulting misc and graphs keepers, as they are maintained independent of the observations. You can select if the misc and graphs should be copied into the the new subset Keeper, as is.

You can also specify the output directory outdir for the new keeper subset. Default is None.

For example, we can extract a subset of keeper with the first 3 observations, using keep_misc=True and keep_graphs=True to keep a copy of miscellaneous data and the graphs. Additionally, we use outdir to specify the output directory we want for this subset of observations

outdir_sub = OUT_DIR.parent / (OUT_DIR.name + '_sub')

keeper_sub = keeper.subset(keeper.observation_labels[:10], 
                           keep_misc=True, keep_graphs=True,
                           outdir=outdir_sub)

Extract data subset#

You can extract a subset of observations and features for any data stored in the data-keeper as a pandas.DataFrame:

rna_sub = keeper.data['rna'].subset(observations=keeper.observation_labels[:3],
                                    features=gene_list[:5])
rna_sub
TCGA-E9-A295-01 TCGA-AR-A1AS-01 TCGA-AQ-A1H2-01
SLC3A1 4.8292 0.0000 3.1360
TIMM13 1078.3533 992.7960 708.2288
STYX 829.6511 542.0982 403.1611
MUC6 1.4488 130.1216 0.0000
TIMM17B 827.7194 573.1652 739.5886

Similarly, you can extract distances (or similarities) between a subset of observations as a pandas.DataFrame. We’ll first add a distance to the keeper to demonstrate this:

from scipy.spatial.distance import cdist
rna_euc = pd.DataFrame(data=cdist(rna.T, rna.T, metric='euclidean'),
                       index=rna.columns.copy(), columns=rna.columns.copy())
keeper.add_distance(rna_euc, 'rna_euc')
keeper.distances['rna_euc'].subset(keeper.observation_labels[:3])
TCGA-E9-A295-01 TCGA-AR-A1AS-01 TCGA-AQ-A1H2-01
TCGA-E9-A295-01 0.000000 351149.925916 392258.653370
TCGA-AR-A1AS-01 351149.925916 0.000000 382107.888105
TCGA-AQ-A1H2-01 392258.653370 382107.888105 0.000000

The same can be done for a similarity stored in keeper using keeper.similarities in place of keeper.distances.

Iterating through feature data, distances or similarities#

Before demonstrating how to iterate through the data, we review some of the Keeper class properties in a bit more detail:

When we instatiate a Keeper object: keeper = netflow.Keeper(), the keeper is initialized with an instance of the DataKeeper class assigned as keeper.data and two instances of the DistanceKeeper class assigned as keeper.distances and keeper.similarities. Data is added to the DataKeeper and DistanceKeeper with a reference key which may then be accessed in the same manner as retrieving a value from a dict. However, the DataKeeper and DistanceKeeper maintain a bit more information than just the data itself to regulate observation (and feature) properties. Therefore, keyed-accessing data from DataKeeper or DistanceKeeper returns an instance of DataView or DistanceView, respectively (instead of the original input data).

For example: x = keeper.data['rna']
Here, x is an instance of the DataView class.
The RNA data itself (as a numpy.ndarray) can be accessed via x.data.
And previously, we demonstrated how to extract the data as a pandas.DataFrame: x.to_frame(). (The .to_frame() property is actually a method of the DataView (and DistanceView) class.)

You can iterate over the data stored in a DataKeeper and DistanceKeeper, which yields an instance of DataView and DistanceView, repectively, at each iteration.

Therefore, you can iterate over the feature data, distances and similarities. The process is the same for each type of store data, so we demonstrate this on the feature data.

Here we iterate through all the feature data and print the keyed-label:

for dd in keeper.data:
    print(dd.label)
rna
cna

Here we iterate through all the feature data and print out the number of features:

for dd in keeper.data:
    print(dd.num_features)
7148
8763

Here we iterate through all the feature data and print out the first 4 feature labels:

for dd in keeper.data:
    print(dd.feature_labels[:4])
['ABCB8', 'SAT1', 'ABCF3', 'ARF1']
['ABCB8', 'SAT1', 'ABCF3', 'ARF1']

Here we iterate through all the feature data and print out the stored data for the first 3 observations and first 4 features:

for dd in keeper.data:
    print(dd.data[:4, :3])
[[  910.2982   624.0432   422.2278]
 [ 2727.0313  2731.2022  3701.4551]
 [ 1310.1533  1381.3597  1413.9488]
 [15602.5595 17668.1675 10149.0216]]
[[2. 2. 1.]
 [2. 2. 1.]
 [2. 3. 2.]
 [3. 3. 2.]]

You can also use the key attribute to see the labels of data stored in any of the keepers:

keeper.data.keys()
dict_keys(['rna', 'cna'])
keeper.distances.keys()
dict_keys(['rna_euc'])
keeper.graphs.keys()
dict_keys(['rna', 'cna'])

Membership#

You can check if data is in keeper.data, keeper.distances, or keeper.similarities by its key.

For example, we next check if the RNA feature data is in keeper.dataYou can check if data is in keeper.data, keeper.distances, or keeper.similarities by its key.

For example, we next check if the RNA feature data is in keeper.data

'rna' in keeper.data
True