{
"cells": [
{
"cell_type": "markdown",
"id": "29989051-ea7c-4d60-a442-fb06c4424235",
"metadata": {},
"source": [
"# Keeper tutorial"
]
},
{
"cell_type": "markdown",
"id": "539af752-a16b-43a1-8215-c39c17e216bd",
"metadata": {},
"source": [
"The primary data structure used in `netflow` is called a `Keeper`. It is used to load, store, manipulate and save data for a set of observations. In particular, there are several specific types of keepers:\n",
"\n",
"- `DataKeeper` : handles feature data\n",
"- `DistanceKeeper` : handles pairwise-observation distances (also used to handle pairwise observation similarities)\n",
"- `GraphKeeper` : handles graphs (networks)\n",
"\n",
"Interacting with `netflow` will primarily entail making use of the predomenent `Keeper`, which implicitly makes use of the aforementioned specific keeper classes. This tutorial therefore focuses on the `Keeper` class, please see the documentation for more detail on the other Keeper classes. \n",
"\n",
"Data is organized in the `Keeper` class via the following attributes:\n",
"\n",
"- `self.oudir` : (directory path) : Path to directory where results will be saved.\n",
" - If not provided, no results can be saved.\n",
"- `self.observation_labels` : (`list`) Observation labels are kept consistent across all feature data, distances and similarities.\n",
"- `self.data` : (`DataKeeper`) Used to handle all feature data.\n",
"- `self.distances` : (`DistanceKeeper`) Used to handle all observation-pairwise distances.\n",
"- `self.similarities` : (`DistanceKeeper`) Used to handle all observation-pairwise similarities.\n",
"- `self.graphs` : (`GraphKeeper`) Used to handle all graphs.\n",
"- `self.misc` : (`dict`) Used to handle any miscellaneous data. \n",
" - Caution should be taken as observation labels and/or ordering of data stored in `self.misc` may not be consistent with the observations as tracked by the `Keeper`.\n",
"\n",
"We will now walk through some use-cases of how to make use of the `Keeper` class."
]
},
{
"cell_type": "markdown",
"id": "28099729-bd14-4b3a-9c08-f029482d5df7",
"metadata": {},
"source": [
"First, import the necessary packages:"
]
},
{
"cell_type": "markdown",
"id": "f994f736-ca88-43d8-bd70-44005fd29592",
"metadata": {},
"source": [
"# Load libraries "
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "379c16e0-418e-46c3-85cf-5ee3b52f58cf",
"metadata": {},
"outputs": [],
"source": [
"import pathlib\n",
"import sys\n",
"\n",
"from collections import defaultdict as ddict\n",
"import itertools\n",
"import matplotlib.pyplot as plt\n",
"import networkx as nx\n",
"import numpy as np\n",
"import pandas as pd\n",
"import scipy.sparse as sc_sparse\n",
"from tqdm import tqdm"
]
},
{
"cell_type": "markdown",
"id": "a8c92cca-8e3a-4f6f-b07a-6bc7a9caf5d4",
"metadata": {},
"source": [
"If ``netflow`` has not been installed, add the path to the library:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "3967f36c-d33a-4fa0-9060-059f01982074",
"metadata": {},
"outputs": [],
"source": [
"sys.path.insert(0, pathlib.Path(pathlib.Path('.').absolute()).parents[3].resolve().as_posix())\n",
"# sys.path.insert(0, pathlib.Path(pathlib.Path('.').absolute()).parents[0].resolve().as_posix())"
]
},
{
"cell_type": "markdown",
"id": "29c246ce-d1dc-4392-8686-62d77a9c9295",
"metadata": {},
"source": [
"From the ``netflow`` package, we load the following modules:\n",
" - The ``InfoNet`` class is used to compute 1-hop neighborhood distances\n",
" - The ``Keeper`` class is used to store and manipulate data/results"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "5b9f724f-13fa-4cab-bcc2-33b2698b1caf",
"metadata": {},
"outputs": [],
"source": [
"import netflow as nf\n",
"\n",
"# from netflow.keepers import keeper "
]
},
{
"cell_type": "markdown",
"id": "c377c3dc-d66d-417d-9da5-3e41e5e9882d",
"metadata": {},
"source": [
"# Set up directories"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "672bd1c1-7153-4db3-b7a3-3e686ce76c9a",
"metadata": {},
"outputs": [],
"source": [
"MAIN_DIR = pathlib.Path('.').absolute()"
]
},
{
"cell_type": "markdown",
"id": "ec4a2415-0a6f-481f-bc2c-14cc4b23f210",
"metadata": {},
"source": [
"Paths to where data is stored:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "e86e42a8-fcf3-4732-b93e-343dd1cbb38e",
"metadata": {},
"outputs": [],
"source": [
"DATA_DIR = MAIN_DIR / 'example_data' / 'breast_tcga'\n",
"\n",
"RNA_FNAME = DATA_DIR / 'rna_606.txt'\n",
"E_RNA_FNAME = DATA_DIR / 'edgelist_hprd_rna_606.txt'\n",
"\n",
"CNA_FNAME = DATA_DIR / 'cna_606.txt'\n",
"E_CNA_FNAME = DATA_DIR / 'edgelist_hprd_cna_606.txt'\n",
"\n",
"METH_FNAME = DATA_DIR / 'methylation_606.txt'\n",
"E_METH_FNAME = DATA_DIR / 'edgelist_hprd_methylation_606.txt'\n",
"\n",
"CLIN_FNAME = DATA_DIR / 'clin_606.txt'"
]
},
{
"cell_type": "markdown",
"id": "5aa0e88c-1d2f-410c-9af3-8f311c6fc482",
"metadata": {},
"source": [
"Directory where output should be saved:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "57c62445-d73b-4225-aa08-71aa81f16985",
"metadata": {},
"outputs": [],
"source": [
"OUT_DIR = MAIN_DIR / 'example_data' / 'results_netflow_breast_tcga'"
]
},
{
"cell_type": "markdown",
"id": "3b383f2a-5a4e-466d-804a-d63b234cb552",
"metadata": {},
"source": [
"# Load data"
]
},
{
"cell_type": "markdown",
"id": "4793b893-a186-492e-b985-6b07ec163ef1",
"metadata": {},
"source": [
"We first load example TCGA breast cancer data that will be used to demonstrate how data can be loaded into the `Keeper`.\n",
"\n",
"Sample inclusion criteria (n=606):\n",
"- Restricted to \n",
" - Primary samples\n",
" - Cancer type: Breast Cancer\n",
" - Detailed cancer type: Breast Invasive Ductal Carcinoma (IDC) or Breast Invasive Lobular Carcinoma (ILC)\n",
"- Has reported overall survival status\n",
"- Overall survival > 3.5 months (Figure 1) \n",
" \n",
" \n",
" \n",
"\n",
" - distribution of age of patients excluded is similar to distribution of the age of all samples (Figure 2)\n",
" \n",
" \n",
" \n",
" \n",
"- Has reported PAM50 subtype\n",
"- Has RNA, CNA and methylation data\n",
"\n",
"- Feature data:\n",
" - RNA (7,148 genes)\n",
" - Remove genes that have zero expression in at least 20% (=121) of samples\n",
" - When multiple Entrez IDs map to the same gene, use the sum\n",
" - Restrict to largest connected component of genes in HPRD\n",
" - CNA (8,763 genes)\n",
" - When multiple Entrez IDs map to the same gene, select the more extreme value (i.e., in absolute value). If the absolute values are the same, select the loss.\n",
" - Restrict to largest connected component of genes in HPRD\n",
" - Translate from [-2, 2] -> [0, 4]\n",
" - methylation (7,969 genes) \n",
" - Missing rate is at most 5.12% - used nearest-neighbor imputation computed in Python using the scikit-learn `KNNImputer` with default parameters except `weights=“distance”`.\n",
" - When multiple Entrez IDs map to the same gene, select the max.\n",
" - Restrict to largest connected component of genes in HPRD\n",
"- Network edgelists (derived from HPRD):\n",
" - RNA edgelist (7,148 nodes and 27,498 edges)\n",
" - CNA edgelist (8,763 nodes and 34,906 edges)\n",
" - methylation edgelist (7,969 nodes and 31,112 edges)\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "5b2b69ec-a949-4b2f-99c9-36fbf01a5ba5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(7148, 606)\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TCGA-E9-A295-01 | \n",
" TCGA-AR-A1AS-01 | \n",
" TCGA-AQ-A1H2-01 | \n",
" TCGA-A8-A08O-01 | \n",
" TCGA-BH-A1FJ-01 | \n",
" TCGA-JL-A3YX-01 | \n",
" TCGA-A7-A425-01 | \n",
" TCGA-AC-A2BM-01 | \n",
" TCGA-LL-A6FP-01 | \n",
" TCGA-A7-A26E-01 | \n",
" ... | \n",
" TCGA-A2-A1G0-01 | \n",
" TCGA-WT-AB41-01 | \n",
" TCGA-EW-A1P6-01 | \n",
" TCGA-XX-A89A-01 | \n",
" TCGA-A7-A4SD-01 | \n",
" TCGA-AC-A6IX-01 | \n",
" TCGA-AR-A24L-01 | \n",
" TCGA-BH-A42U-01 | \n",
" TCGA-AR-A24S-01 | \n",
" TCGA-BH-A0BC-01 | \n",
"
\n",
" \n",
" \n",
" \n",
" ABCB8 | \n",
" 910.2982 | \n",
" 624.0432 | \n",
" 422.2278 | \n",
" 659.3773 | \n",
" 836.3215 | \n",
" 1202.2298 | \n",
" 1264.0832 | \n",
" 759.0133 | \n",
" 2768.8172 | \n",
" 560.3929 | \n",
" ... | \n",
" 840.2367 | \n",
" 1074.9211 | \n",
" 769.4139 | \n",
" 922.0595 | \n",
" 597.0149 | \n",
" 863.0025 | \n",
" 451.0316 | \n",
" 795.7087 | \n",
" 606.1712 | \n",
" 950.1840 | \n",
"
\n",
" \n",
" SAT1 | \n",
" 2727.0313 | \n",
" 2731.2022 | \n",
" 3701.4551 | \n",
" 2259.7105 | \n",
" 3085.1455 | \n",
" 3915.4537 | \n",
" 3230.3552 | \n",
" 2042.0907 | \n",
" 3258.4005 | \n",
" 4837.8020 | \n",
" ... | \n",
" 1728.7968 | \n",
" 6252.3659 | \n",
" 2843.2332 | \n",
" 7392.0642 | \n",
" 3602.4876 | \n",
" 3878.9536 | \n",
" 2552.8859 | \n",
" 5193.3684 | \n",
" 5891.7026 | \n",
" 3594.8176 | \n",
"
\n",
" \n",
"
\n",
"
2 rows × 606 columns
\n",
"
"
],
"text/plain": [
" TCGA-E9-A295-01 TCGA-AR-A1AS-01 TCGA-AQ-A1H2-01 TCGA-A8-A08O-01 \\\n",
"ABCB8 910.2982 624.0432 422.2278 659.3773 \n",
"SAT1 2727.0313 2731.2022 3701.4551 2259.7105 \n",
"\n",
" TCGA-BH-A1FJ-01 TCGA-JL-A3YX-01 TCGA-A7-A425-01 TCGA-AC-A2BM-01 \\\n",
"ABCB8 836.3215 1202.2298 1264.0832 759.0133 \n",
"SAT1 3085.1455 3915.4537 3230.3552 2042.0907 \n",
"\n",
" TCGA-LL-A6FP-01 TCGA-A7-A26E-01 ... TCGA-A2-A1G0-01 \\\n",
"ABCB8 2768.8172 560.3929 ... 840.2367 \n",
"SAT1 3258.4005 4837.8020 ... 1728.7968 \n",
"\n",
" TCGA-WT-AB41-01 TCGA-EW-A1P6-01 TCGA-XX-A89A-01 TCGA-A7-A4SD-01 \\\n",
"ABCB8 1074.9211 769.4139 922.0595 597.0149 \n",
"SAT1 6252.3659 2843.2332 7392.0642 3602.4876 \n",
"\n",
" TCGA-AC-A6IX-01 TCGA-AR-A24L-01 TCGA-BH-A42U-01 TCGA-AR-A24S-01 \\\n",
"ABCB8 863.0025 451.0316 795.7087 606.1712 \n",
"SAT1 3878.9536 2552.8859 5193.3684 5891.7026 \n",
"\n",
" TCGA-BH-A0BC-01 \n",
"ABCB8 950.1840 \n",
"SAT1 3594.8176 \n",
"\n",
"[2 rows x 606 columns]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"rna = pd.read_csv(RNA_FNAME, header=0, index_col=0)\n",
"print(rna.shape)\n",
"\n",
"display(rna.head(2))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "84f5b6ec-38dd-42f6-96e3-9d5bdca220b6",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" source | \n",
" target | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" ABCB8 | \n",
" SAT1 | \n",
"
\n",
" \n",
" 1 | \n",
" SAT1 | \n",
" APLP1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" source target\n",
"0 ABCB8 SAT1\n",
"1 SAT1 APLP1"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Graph named 'rna' with 7148 nodes and 27498 edges\n"
]
}
],
"source": [
"E_rna = pd.read_csv(E_RNA_FNAME, header=0)\n",
"display(E_rna.head(2))\n",
"\n",
"G_rna = nx.from_pandas_edgelist(E_rna)\n",
"G_rna.name = 'rna'\n",
"print(G_rna)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "3077cc48-c8a9-4a53-b669-08f0d9497c4a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(8763, 606)\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TCGA-E9-A295-01 | \n",
" TCGA-AR-A1AS-01 | \n",
" TCGA-AQ-A1H2-01 | \n",
" TCGA-A8-A08O-01 | \n",
" TCGA-BH-A1FJ-01 | \n",
" TCGA-JL-A3YX-01 | \n",
" TCGA-A7-A425-01 | \n",
" TCGA-AC-A2BM-01 | \n",
" TCGA-LL-A6FP-01 | \n",
" TCGA-A7-A26E-01 | \n",
" ... | \n",
" TCGA-A2-A1G0-01 | \n",
" TCGA-WT-AB41-01 | \n",
" TCGA-EW-A1P6-01 | \n",
" TCGA-XX-A89A-01 | \n",
" TCGA-A7-A4SD-01 | \n",
" TCGA-AC-A6IX-01 | \n",
" TCGA-AR-A24L-01 | \n",
" TCGA-BH-A42U-01 | \n",
" TCGA-AR-A24S-01 | \n",
" TCGA-BH-A0BC-01 | \n",
"
\n",
" \n",
" \n",
" \n",
" ABCB8 | \n",
" 2 | \n",
" 2 | \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 3 | \n",
" 2 | \n",
" 3 | \n",
" 2 | \n",
" ... | \n",
" 3 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
"
\n",
" \n",
" SAT1 | \n",
" 2 | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
" ... | \n",
" 3 | \n",
" 2 | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
2 rows × 606 columns
\n",
"
"
],
"text/plain": [
" TCGA-E9-A295-01 TCGA-AR-A1AS-01 TCGA-AQ-A1H2-01 TCGA-A8-A08O-01 \\\n",
"ABCB8 2 2 1 2 \n",
"SAT1 2 2 1 1 \n",
"\n",
" TCGA-BH-A1FJ-01 TCGA-JL-A3YX-01 TCGA-A7-A425-01 TCGA-AC-A2BM-01 \\\n",
"ABCB8 2 2 3 2 \n",
"SAT1 2 1 2 1 \n",
"\n",
" TCGA-LL-A6FP-01 TCGA-A7-A26E-01 ... TCGA-A2-A1G0-01 \\\n",
"ABCB8 3 2 ... 3 \n",
"SAT1 2 2 ... 3 \n",
"\n",
" TCGA-WT-AB41-01 TCGA-EW-A1P6-01 TCGA-XX-A89A-01 TCGA-A7-A4SD-01 \\\n",
"ABCB8 2 2 2 1 \n",
"SAT1 2 2 1 1 \n",
"\n",
" TCGA-AC-A6IX-01 TCGA-AR-A24L-01 TCGA-BH-A42U-01 TCGA-AR-A24S-01 \\\n",
"ABCB8 2 2 2 2 \n",
"SAT1 2 2 2 2 \n",
"\n",
" TCGA-BH-A0BC-01 \n",
"ABCB8 2 \n",
"SAT1 2 \n",
"\n",
"[2 rows x 606 columns]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"cna = pd.read_csv(CNA_FNAME, header=0, index_col=0)\n",
"print(cna.shape)\n",
"\n",
"display(cna.head(2))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "8d8a7d58-c052-4e96-a7fb-287c2a2018d3",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" source | \n",
" target | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" ABCB8 | \n",
" SAT1 | \n",
"
\n",
" \n",
" 1 | \n",
" SAT1 | \n",
" APLP1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" source target\n",
"0 ABCB8 SAT1\n",
"1 SAT1 APLP1"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Graph named 'cna' with 8763 nodes and 34906 edges\n"
]
}
],
"source": [
"E_cna = pd.read_csv(E_CNA_FNAME, header=0)\n",
"display(E_cna.head(2))\n",
"\n",
"G_cna = nx.from_pandas_edgelist(E_cna)\n",
"G_cna.name = 'cna'\n",
"print(G_cna)"
]
},
{
"cell_type": "markdown",
"id": "17bbb5a1-77c8-44c1-9681-ce7df1fc21a5",
"metadata": {},
"source": [
"# Initialize the Keeper"
]
},
{
"cell_type": "markdown",
"id": "b6c0d38c-8727-46f6-9fbc-df1978252aaa",
"metadata": {},
"source": [
"The `Keeper` can be instatiated with or without `outdir` - an output directory:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "fad5a3fc-4c89-4494-afc1-364f7e0864d5",
"metadata": {},
"outputs": [],
"source": [
"# uncomment to initialize Keeper with no output directory:\n",
"# keeper = nf.Keeper() \n",
"\n",
"# initializing Keeper with output directory:\n",
"keeper = nf.Keeper(outdir=OUT_DIR)"
]
},
{
"cell_type": "markdown",
"id": "419f6157-29f3-44ef-8523-c44fae08848b",
"metadata": {},
"source": [
"See the documentation for more details on initializing the `Keeper`."
]
},
{
"cell_type": "markdown",
"id": "f281c332-33cd-423a-9b04-1436b3b87a53",
"metadata": {},
"source": [
"# Load data into the Keeper"
]
},
{
"cell_type": "markdown",
"id": "13375e59-4a54-4a9b-9796-ce729dd1be63",
"metadata": {},
"source": [
"Currently, data is expected to be either in the form of a `numpy.ndarray` or `pandas.DataFrame`, or saved in a file that is loadable via `pandas.read_csv`, with observations as columns and features as rows."
]
},
{
"cell_type": "markdown",
"id": "f3f5b567-63dc-461f-babc-63a8d5c51fc2",
"metadata": {},
"source": [
"Data can be loaded into the Keeper in two ways: \n",
" i. From a `numpy.ndarray` or `pandas.DataFrame` or\n",
" ii. Directly from a file that is loadable via `pandas.read_csv`\n",
" \n",
"Note: Future releases will handle various file types."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "112296ad-452d-4180-bc1c-2eecd3086c6d",
"metadata": {},
"outputs": [],
"source": [
"keeper.add_data(rna, 'rna')"
]
},
{
"cell_type": "markdown",
"id": "2b01a2a5-fc23-4344-898b-aa417bfae547",
"metadata": {},
"source": [
"The first time data is loaded into `keeper`, it sets the observation labels that must be consistent with any other feature data set, pairwise-observation distances data and pairwise-observation similarities data that will be added to `keeper` here on out. \n",
"\n",
"If instead the feature data was provided as a `numpy.ndarray`, observation labels default to `'X0', 'X1', ...`."
]
},
{
"cell_type": "markdown",
"id": "b2808dba-f8be-4248-a936-395d286c5cb4",
"metadata": {},
"source": [
"Within the `netflow` environment, this data can be specified by specifying the data set's reference label or __key__, e.g., `key='rna'`."
]
},
{
"cell_type": "markdown",
"id": "e5fd904e-37a6-4ede-b5f2-dd3d4de72d27",
"metadata": {},
"source": [
"We see that `keeper` has updated the observation labels:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "3edddfbd-e14a-4433-a5e1-9c6e3199ff4b",
"metadata": {},
"outputs": [],
"source": [
"# uncomment to print out all observation labels:\n",
"# keeper.observation_labels"
]
},
{
"cell_type": "markdown",
"id": "c246c5a6-80b3-4acb-9bba-8857528f2c6e",
"metadata": {},
"source": [
"We can also upload the cna data:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "8f06d0a6-e69d-48d1-89b1-eb43fc1f6296",
"metadata": {},
"outputs": [],
"source": [
"keeper.add_data(cna, 'cna')"
]
},
{
"cell_type": "markdown",
"id": "e99dc9f8-c3b5-4d08-aab3-a2ad16ba56c6",
"metadata": {},
"source": [
"Similarly, we can add the graphs associated with the RNA data:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "6cbee537-cb5f-405d-9896-d83d5407f579",
"metadata": {},
"outputs": [],
"source": [
"keeper.add_graph(G_rna, 'rna')"
]
},
{
"cell_type": "markdown",
"id": "a8898b35-9add-48ae-9081-3694f32f6976",
"metadata": {},
"source": [
"If we had pairwise-observation distances or similarities, we could add them to `keeper` in the same manner using the methods:\n",
"\n",
"- `keeper.add_distance()`\n",
"- `keeper.add_similarity()`"
]
},
{
"cell_type": "markdown",
"id": "521084b4-12c7-4fb3-9369-d19081df0bd8",
"metadata": {},
"source": [
"Suppose we had some miscellaneous data, for example a list of genes of interest, that we wanted to store in the `Keeper`, this could be done as follows:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "79a9a0aa-3823-461a-b384-d0b491de3733",
"metadata": {},
"outputs": [],
"source": [
"gene_list = ['SLC3A1', 'TIMM13', 'STYX', 'MUC6', 'TIMM17B', 'TNFSF18', 'TOMM7', 'TOMM40',\n",
" 'TRAPPC2L', 'AZU1', 'TUBB1', 'XCR1', 'VPS33B', 'XCL2', 'KNTC1']"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "4e9b135f-bcd1-4835-adc2-b7e732f46a72",
"metadata": {},
"outputs": [],
"source": [
"keeper.add_misc(gene_list, 'my_gene_list')"
]
},
{
"cell_type": "markdown",
"id": "b7a7d54c-9733-4e24-9ded-b4e3c3389ba9",
"metadata": {},
"source": [
"# Load data into keeper at time of instatiation"
]
},
{
"cell_type": "markdown",
"id": "ea40f617-6e25-44e9-ab05-fad88f93121f",
"metadata": {},
"source": [
"Alternatively, you can specify any data, distances, similarities, graphs, or miscelaneous data to be loaded into the `Keeper` at the same time as it's initialized. For example, we initialize the `Keeper` and load the RNA and CNA feature data and their associated graphs in one call:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "4561208d-fdc1-4432-97c4-fd6edd593c23",
"metadata": {},
"outputs": [],
"source": [
"keeper = nf.Keeper(data={'rna': rna, 'cna': cna}, graphs={'rna': G_rna, 'cna': G_cna})"
]
},
{
"cell_type": "markdown",
"id": "966f5765-08e6-4615-8303-4ae9cff3b1e5",
"metadata": {},
"source": [
"See the documentation on additional options for initializing and loading data into a `Keeper`."
]
},
{
"cell_type": "markdown",
"id": "5a80cfc7-a6ce-4f2e-9cf9-70b92b2fee69",
"metadata": {},
"source": [
"# Load data into Keeper from file"
]
},
{
"cell_type": "markdown",
"id": "ac5f3e64-a218-47d0-8436-f31bb14dcd03",
"metadata": {},
"source": [
"Alternatively, we can load data, distances, similarities, and graphs directly from a file. \n",
"\n",
"Currently, file formats that can be loaded by `pandas.read_csv` are accepted. Future releases will offer additional file types."
]
},
{
"cell_type": "markdown",
"id": "1ce2f23e-1ea2-432d-8034-125f53c0bb5e",
"metadata": {},
"source": [
"We start by initializing a `Keeper` and then load the RNA and CNA data from file. The argument `label` is used to specify how the data is referenced in the `Keeper`."
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "7ffd2461-5709-49dc-b0f8-0eab0d16d81e",
"metadata": {},
"outputs": [],
"source": [
"keeper = nf.Keeper(outdir=OUT_DIR)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "8fa87b90-b639-41c9-92b7-3b83acc4f59e",
"metadata": {},
"outputs": [],
"source": [
"# Add RNA and CNA feature data:\n",
"keeper.load_data(RNA_FNAME, label='rna', header=0, index_col=0) \n",
"keeper.load_data(CNA_FNAME, label='cna', header=0, index_col=0, dtype=float)\n",
"\n",
"# uncomment to add methylation feature data:\n",
"# keeper.load_data(METH_FNAME, label='meth', header=0, index_col=0)"
]
},
{
"cell_type": "markdown",
"id": "263aa444-a4b3-4b50-a932-8e3afb562d09",
"metadata": {},
"source": [
"We next add the graphs associated with RNA and CNA. (See the documentation for more details and the expected edgelist format.)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "690e8e4b-6b5c-4f06-9d33-4eaae69d43c5",
"metadata": {},
"outputs": [],
"source": [
"keeper.load_graph(E_RNA_FNAME, label='rna')\n",
"keeper.load_graph(E_CNA_FNAME, label='cna')\n",
"\n",
"# uncomment to add methylation graph:\n",
"# keeper.load_graph(E_METH_FNAME, label='meth')"
]
},
{
"cell_type": "markdown",
"id": "4be7843b-3e74-4590-be6e-ab9f7b735592",
"metadata": {},
"source": [
"Similarities and distances can be loaded from file into the `Keeper` in the same manner via:\n",
"\n",
"- `keeper.load_distance()` or `keeper.load_stacked_distance()`\n",
"- `keeper.load_similarity()` or `keeper.load_stacked_similarity()`\n",
"\n",
"See the documentaion for more details."
]
},
{
"cell_type": "markdown",
"id": "070f5941-f984-41cc-9cc4-a38c7e18a054",
"metadata": {},
"source": [
"# Extract data from the Keeper"
]
},
{
"cell_type": "markdown",
"id": "a0299f1a-af22-44fa-bb6b-d6a047948074",
"metadata": {},
"source": [
"We can extract the data in the form of a `pandas.DataFrame` as follows:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "a2ea3efb-1096-42c7-8204-81455cf9162a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TCGA-E9-A295-01 | \n",
" TCGA-AR-A1AS-01 | \n",
" TCGA-AQ-A1H2-01 | \n",
" TCGA-A8-A08O-01 | \n",
" TCGA-BH-A1FJ-01 | \n",
" TCGA-JL-A3YX-01 | \n",
" TCGA-A7-A425-01 | \n",
" TCGA-AC-A2BM-01 | \n",
" TCGA-LL-A6FP-01 | \n",
" TCGA-A7-A26E-01 | \n",
" ... | \n",
" TCGA-A2-A1G0-01 | \n",
" TCGA-WT-AB41-01 | \n",
" TCGA-EW-A1P6-01 | \n",
" TCGA-XX-A89A-01 | \n",
" TCGA-A7-A4SD-01 | \n",
" TCGA-AC-A6IX-01 | \n",
" TCGA-AR-A24L-01 | \n",
" TCGA-BH-A42U-01 | \n",
" TCGA-AR-A24S-01 | \n",
" TCGA-BH-A0BC-01 | \n",
"
\n",
" \n",
" \n",
" \n",
" ABCB8 | \n",
" 910.2982 | \n",
" 624.0432 | \n",
" 422.2278 | \n",
" 659.3773 | \n",
" 836.3215 | \n",
" 1202.2298 | \n",
" 1264.0832 | \n",
" 759.0133 | \n",
" 2768.8172 | \n",
" 560.3929 | \n",
" ... | \n",
" 840.2367 | \n",
" 1074.9211 | \n",
" 769.4139 | \n",
" 922.0595 | \n",
" 597.0149 | \n",
" 863.0025 | \n",
" 451.0316 | \n",
" 795.7087 | \n",
" 606.1712 | \n",
" 950.1840 | \n",
"
\n",
" \n",
" SAT1 | \n",
" 2727.0313 | \n",
" 2731.2022 | \n",
" 3701.4551 | \n",
" 2259.7105 | \n",
" 3085.1455 | \n",
" 3915.4537 | \n",
" 3230.3552 | \n",
" 2042.0907 | \n",
" 3258.4005 | \n",
" 4837.8020 | \n",
" ... | \n",
" 1728.7968 | \n",
" 6252.3659 | \n",
" 2843.2332 | \n",
" 7392.0642 | \n",
" 3602.4876 | \n",
" 3878.9536 | \n",
" 2552.8859 | \n",
" 5193.3684 | \n",
" 5891.7026 | \n",
" 3594.8176 | \n",
"
\n",
" \n",
" ABCF3 | \n",
" 1310.1533 | \n",
" 1381.3597 | \n",
" 1413.9488 | \n",
" 1078.8295 | \n",
" 992.8420 | \n",
" 1273.1496 | \n",
" 1344.0976 | \n",
" 1256.8570 | \n",
" 2183.8038 | \n",
" 1067.4277 | \n",
" ... | \n",
" 1055.8843 | \n",
" 2185.3312 | \n",
" 1145.1250 | \n",
" 1031.6486 | \n",
" 1705.9701 | \n",
" 1292.1338 | \n",
" 924.5234 | \n",
" 1223.4741 | \n",
" 1003.6413 | \n",
" 1278.0836 | \n",
"
\n",
" \n",
"
\n",
"
3 rows × 606 columns
\n",
"
"
],
"text/plain": [
" TCGA-E9-A295-01 TCGA-AR-A1AS-01 TCGA-AQ-A1H2-01 TCGA-A8-A08O-01 \\\n",
"ABCB8 910.2982 624.0432 422.2278 659.3773 \n",
"SAT1 2727.0313 2731.2022 3701.4551 2259.7105 \n",
"ABCF3 1310.1533 1381.3597 1413.9488 1078.8295 \n",
"\n",
" TCGA-BH-A1FJ-01 TCGA-JL-A3YX-01 TCGA-A7-A425-01 TCGA-AC-A2BM-01 \\\n",
"ABCB8 836.3215 1202.2298 1264.0832 759.0133 \n",
"SAT1 3085.1455 3915.4537 3230.3552 2042.0907 \n",
"ABCF3 992.8420 1273.1496 1344.0976 1256.8570 \n",
"\n",
" TCGA-LL-A6FP-01 TCGA-A7-A26E-01 ... TCGA-A2-A1G0-01 \\\n",
"ABCB8 2768.8172 560.3929 ... 840.2367 \n",
"SAT1 3258.4005 4837.8020 ... 1728.7968 \n",
"ABCF3 2183.8038 1067.4277 ... 1055.8843 \n",
"\n",
" TCGA-WT-AB41-01 TCGA-EW-A1P6-01 TCGA-XX-A89A-01 TCGA-A7-A4SD-01 \\\n",
"ABCB8 1074.9211 769.4139 922.0595 597.0149 \n",
"SAT1 6252.3659 2843.2332 7392.0642 3602.4876 \n",
"ABCF3 2185.3312 1145.1250 1031.6486 1705.9701 \n",
"\n",
" TCGA-AC-A6IX-01 TCGA-AR-A24L-01 TCGA-BH-A42U-01 TCGA-AR-A24S-01 \\\n",
"ABCB8 863.0025 451.0316 795.7087 606.1712 \n",
"SAT1 3878.9536 2552.8859 5193.3684 5891.7026 \n",
"ABCF3 1292.1338 924.5234 1223.4741 1003.6413 \n",
"\n",
" TCGA-BH-A0BC-01 \n",
"ABCB8 950.1840 \n",
"SAT1 3594.8176 \n",
"ABCF3 1278.0836 \n",
"\n",
"[3 rows x 606 columns]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = keeper.data['rna'].to_frame()\n",
"df.head(3)"
]
},
{
"cell_type": "markdown",
"id": "f9a516e9-e9b4-4fe5-ae04-731fbc8fe587",
"metadata": {},
"source": [
"Similarly, this can be done for any key in the distance or similarity keeper in the form:\n",
"\n",
"- `keeper.distances[key].to_frame()`\n",
"- `keeper.similarities[key].to_frame()`"
]
},
{
"cell_type": "markdown",
"id": "98c383f1-5bbe-47b6-9ea6-4b51d351d2ed",
"metadata": {},
"source": [
"Next, we demonstrate how to get a graph that has been stored in `keeper`, keyed by its reference label:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "6ef786bd-d337-4677-aa4a-5df003e77d6c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Graph named 'rna' with 7148 nodes and 27498 edges\n"
]
}
],
"source": [
"G = keeper.graphs['rna']\n",
"print(G)"
]
},
{
"cell_type": "markdown",
"id": "4e974df7-deb9-46cc-9c15-e65132a927e6",
"metadata": {},
"source": [
"Similarly, we can access miscellaneous data stored in `keeper`:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "27e0543e-7e0e-41bb-8f33-737d2e873e04",
"metadata": {},
"outputs": [],
"source": [
"# first we add the gene list to the keeper:\n",
"keeper.add_misc(gene_list, 'my_gene_list')"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "7ccf5ce6-0e32-4d16-82f0-6e19b1d57825",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['SLC3A1',\n",
" 'TIMM13',\n",
" 'STYX',\n",
" 'MUC6',\n",
" 'TIMM17B',\n",
" 'TNFSF18',\n",
" 'TOMM7',\n",
" 'TOMM40',\n",
" 'TRAPPC2L',\n",
" 'AZU1',\n",
" 'TUBB1',\n",
" 'XCR1',\n",
" 'VPS33B',\n",
" 'XCL2',\n",
" 'KNTC1']"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# and to access the gene list:\n",
"keeper.misc['my_gene_list']"
]
},
{
"cell_type": "markdown",
"id": "81e7ffd6-32ad-41d3-b374-78a90a3953f2",
"metadata": {},
"source": [
"# Save data from the Keeper"
]
},
{
"cell_type": "markdown",
"id": "5b074dc9-2937-4037-a4ef-675ccf0d3b03",
"metadata": {},
"source": [
"Data can be saved from:\n",
"- `keeper.data` via `keeper.save_data()`\n",
"- `keeper.distances` via `keeper.save_distance()`\n",
"- `keeper.similarities` via `keeper.save_similarity()`\n",
"- `keeper.misc` (if it is a `pandas.DataFrame`) via `keeper.save_misc()`\n",
"\n",
"Currently, it is saved using `pandas.to_csv`. Future releases will provide additional formats for saving data.\n",
"\n",
"In order to save data, `keeper` must have the attribute `outdir` defined.\n",
"\n",
"The data is then saved to a file named `{outdir}/{data_type}_{label}.{file_format}` where\n",
"- `outdir` : The keeper's output directory: `keeper.outdir`.\n",
"- `data_type` : This is one of {'data', 'distance', 'similarity', 'misc'}, depending on which method is called to save the data.\n",
"- `label` : This is reference label for the data in the keeper that should be saved, specified by the user.\n",
"- `file_format` This is the file extension, provided by the user (default = 'txt'). \n",
"\n",
"See the documentation for more details."
]
},
{
"cell_type": "markdown",
"id": "2f025109-eb53-4627-89ec-e1d3746ede49",
"metadata": {},
"source": [
"For example, to save the RNA data to the specified output directory, you would call:\n",
"\n",
"`keeper.save_data('rna')`"
]
},
{
"cell_type": "markdown",
"id": "3c4b2b00-9abb-427c-a861-b5f1bd7d2203",
"metadata": {},
"source": [
"# Extract Keeper subset"
]
},
{
"cell_type": "markdown",
"id": "acc4703c-2619-4acb-964a-8473c02f11b6",
"metadata": {},
"source": [
"You can a new `Keeper` instance with a subset of the observations from the `keeper`. \n",
"\n",
"Caution should be taken for the resulting `misc` and `graphs` keepers, as they are maintained independent of the observations. You can select if the misc and graphs should be copied into the the new subset Keeper, as is. "
]
},
{
"cell_type": "markdown",
"id": "a0404eb6-222d-49bb-91fa-9f40209bae04",
"metadata": {},
"source": [
"You can also specify the output directory `outdir` for the new keeper subset. Default is `None`."
]
},
{
"cell_type": "markdown",
"id": "5e8758aa-e2ff-4f19-9c56-263aed2337da",
"metadata": {},
"source": [
"For example, we can extract a subset of `keeper` with the first 3 observations, using `keep_misc=True` and `keep_graphs=True` to keep a copy of miscellaneous data and the graphs. Additionally, we use `outdir` to specify the output directory we want for this subset of observations"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "54c86564-a8d3-48b5-8539-65c9dae2ecc2",
"metadata": {},
"outputs": [],
"source": [
"outdir_sub = OUT_DIR.parent / (OUT_DIR.name + '_sub')\n",
"\n",
"keeper_sub = keeper.subset(keeper.observation_labels[:10], \n",
" keep_misc=True, keep_graphs=True,\n",
" outdir=outdir_sub)"
]
},
{
"cell_type": "markdown",
"id": "50a3f266-e1cd-4363-955c-81973d0394f1",
"metadata": {},
"source": [
"# Extract data subset"
]
},
{
"cell_type": "markdown",
"id": "ebdf1642-a9a2-41b6-be62-75b530db3f2f",
"metadata": {},
"source": [
"You can extract a subset of observations and features for any data stored in the data-keeper as a `pandas.DataFrame`: "
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "c0572c60-8bcb-4259-8c78-69dac3ec4b7d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TCGA-E9-A295-01 | \n",
" TCGA-AR-A1AS-01 | \n",
" TCGA-AQ-A1H2-01 | \n",
"
\n",
" \n",
" \n",
" \n",
" SLC3A1 | \n",
" 4.8292 | \n",
" 0.0000 | \n",
" 3.1360 | \n",
"
\n",
" \n",
" TIMM13 | \n",
" 1078.3533 | \n",
" 992.7960 | \n",
" 708.2288 | \n",
"
\n",
" \n",
" STYX | \n",
" 829.6511 | \n",
" 542.0982 | \n",
" 403.1611 | \n",
"
\n",
" \n",
" MUC6 | \n",
" 1.4488 | \n",
" 130.1216 | \n",
" 0.0000 | \n",
"
\n",
" \n",
" TIMM17B | \n",
" 827.7194 | \n",
" 573.1652 | \n",
" 739.5886 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" TCGA-E9-A295-01 TCGA-AR-A1AS-01 TCGA-AQ-A1H2-01\n",
"SLC3A1 4.8292 0.0000 3.1360\n",
"TIMM13 1078.3533 992.7960 708.2288\n",
"STYX 829.6511 542.0982 403.1611\n",
"MUC6 1.4488 130.1216 0.0000\n",
"TIMM17B 827.7194 573.1652 739.5886"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rna_sub = keeper.data['rna'].subset(observations=keeper.observation_labels[:3],\n",
" features=gene_list[:5])\n",
"rna_sub"
]
},
{
"cell_type": "markdown",
"id": "0275c9f0-b067-45b1-aaad-637e414c43ab",
"metadata": {},
"source": [
"Similarly, you can extract distances (or similarities) between a subset of observations as a `pandas.DataFrame`. We'll first add a distance to the keeper to demonstrate this:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "6015db82-ec4c-4a62-a194-cd691c772d08",
"metadata": {},
"outputs": [],
"source": [
"from scipy.spatial.distance import cdist"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "e8741860-5716-4999-90dc-8f1964ec4c46",
"metadata": {},
"outputs": [],
"source": [
"rna_euc = pd.DataFrame(data=cdist(rna.T, rna.T, metric='euclidean'),\n",
" index=rna.columns.copy(), columns=rna.columns.copy())\n",
"keeper.add_distance(rna_euc, 'rna_euc')"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "4e5acca6-63c7-4094-81cb-5ba27cd4edce",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TCGA-E9-A295-01 | \n",
" TCGA-AR-A1AS-01 | \n",
" TCGA-AQ-A1H2-01 | \n",
"
\n",
" \n",
" \n",
" \n",
" TCGA-E9-A295-01 | \n",
" 0.000000 | \n",
" 351149.925916 | \n",
" 392258.653370 | \n",
"
\n",
" \n",
" TCGA-AR-A1AS-01 | \n",
" 351149.925916 | \n",
" 0.000000 | \n",
" 382107.888105 | \n",
"
\n",
" \n",
" TCGA-AQ-A1H2-01 | \n",
" 392258.653370 | \n",
" 382107.888105 | \n",
" 0.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" TCGA-E9-A295-01 TCGA-AR-A1AS-01 TCGA-AQ-A1H2-01\n",
"TCGA-E9-A295-01 0.000000 351149.925916 392258.653370\n",
"TCGA-AR-A1AS-01 351149.925916 0.000000 382107.888105\n",
"TCGA-AQ-A1H2-01 392258.653370 382107.888105 0.000000"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"keeper.distances['rna_euc'].subset(keeper.observation_labels[:3])"
]
},
{
"cell_type": "markdown",
"id": "c1d7957c-5e47-4b5b-a872-e98ed44deb2a",
"metadata": {},
"source": [
"The same can be done for a similarity stored in `keeper` using `keeper.similarities` in place of `keeper.distances`."
]
},
{
"cell_type": "markdown",
"id": "812e6da6-c8f9-43f0-a51e-8a673e13bebc",
"metadata": {},
"source": [
"# Iterating through feature data, distances or similarities"
]
},
{
"cell_type": "markdown",
"id": "001757b6-2ef3-4a24-9052-c329007b3623",
"metadata": {},
"source": [
"Before demonstrating how to iterate through the data, we review some of the `Keeper` class properties in a bit more detail: \n",
"\n",
"When we instatiate a `Keeper` object: `keeper = netflow.Keeper()`, \n",
"the `keeper` is initialized with an instance of the `DataKeeper` class assigned as `keeper.data` and two instances of the `DistanceKeeper` class assigned as `keeper.distances` and `keeper.similarities`. Data is added to the `DataKeeper` and `DistanceKeeper` with a reference key which may then be accessed in the same manner as retrieving a value from a `dict`. However, the `DataKeeper` and `DistanceKeeper` maintain a bit more information than just the data itself to regulate observation (and feature) properties. Therefore, keyed-accessing data from `DataKeeper` or `DistanceKeeper` returns an instance of `DataView` or `DistanceView`, respectively (instead of the original input data). \n",
"\n",
"For example: `x = keeper.data['rna']` \n",
"Here, `x` is an instance of the `DataView` class. \n",
"The RNA data itself (as a `numpy.ndarray`) can be accessed via `x.data`. \n",
"And previously, we demonstrated how to extract the data as a `pandas.DataFrame`: `x.to_frame()`. (The `.to_frame()` property is actually a method of the `DataView` (and `DistanceView`) class.)\n",
"\n",
"You can iterate over the data stored in a `DataKeeper` and `DistanceKeeper`, which yields an instance of `DataView` and `DistanceView`, repectively, at each iteration.\n",
"\n",
"Therefore, you can iterate over the feature data, distances and similarities. The process is the same for each type of store data, so we demonstrate this on the feature data."
]
},
{
"cell_type": "markdown",
"id": "ea16b1b6-96f0-491e-aac8-3b246641e3c2",
"metadata": {},
"source": [
"Here we iterate through all the feature data and print the keyed-label:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "2f983bc5-bf6d-4b10-a28c-941bdaf95555",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"rna\n",
"cna\n"
]
}
],
"source": [
"for dd in keeper.data:\n",
" print(dd.label)"
]
},
{
"cell_type": "markdown",
"id": "adf336a9-e8a5-4303-868e-c57b0f6fe2e8",
"metadata": {},
"source": [
"Here we iterate through all the feature data and print out the number of features:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "ed462bf2-2c2f-43c1-9b8f-b0a40694d133",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"7148\n",
"8763\n"
]
}
],
"source": [
"for dd in keeper.data:\n",
" print(dd.num_features)"
]
},
{
"cell_type": "markdown",
"id": "eb4024be-baa7-4d1d-ac12-ad4da4063350",
"metadata": {},
"source": [
"Here we iterate through all the feature data and print out the first 4 feature labels:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "804ef414-5812-4f36-a29b-e8e39fc6a450",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['ABCB8', 'SAT1', 'ABCF3', 'ARF1']\n",
"['ABCB8', 'SAT1', 'ABCF3', 'ARF1']\n"
]
}
],
"source": [
"for dd in keeper.data:\n",
" print(dd.feature_labels[:4])"
]
},
{
"cell_type": "markdown",
"id": "a6392e1c-69e0-4430-ad37-f122db11c884",
"metadata": {},
"source": [
"Here we iterate through all the feature data and print out the stored data for the first 3 observations and first 4 features:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "236c58fe-0115-4175-8176-d280ef2dabe2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 910.2982 624.0432 422.2278]\n",
" [ 2727.0313 2731.2022 3701.4551]\n",
" [ 1310.1533 1381.3597 1413.9488]\n",
" [15602.5595 17668.1675 10149.0216]]\n",
"[[2. 2. 1.]\n",
" [2. 2. 1.]\n",
" [2. 3. 2.]\n",
" [3. 3. 2.]]\n"
]
}
],
"source": [
"for dd in keeper.data:\n",
" print(dd.data[:4, :3])"
]
},
{
"cell_type": "markdown",
"id": "92e71373-0acf-4435-a16c-b4d55b1e426c",
"metadata": {},
"source": [
"You can also use the `key` attribute to see the labels of data stored in any of the keepers:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "295cca36-756b-44fe-81fa-993ab14ce417",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['rna', 'cna'])"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"keeper.data.keys()"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "071fdc62-6913-48db-a2fb-6006284bc07b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['rna_euc'])"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"keeper.distances.keys()"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "4c2a18a6-69c9-412d-993d-9a5500fab11f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['rna', 'cna'])"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"keeper.graphs.keys()"
]
},
{
"cell_type": "markdown",
"id": "c5200047-9b87-467e-9483-2adcd0f35c27",
"metadata": {},
"source": [
"# Membership"
]
},
{
"cell_type": "markdown",
"id": "5b3b96c8-9f68-456a-b952-11aede03f789",
"metadata": {},
"source": [
"You can check if data is in `keeper.data`, `keeper.distances`, or `keeper.similarities` by its key.\n",
"\n",
"For example, we next check if the RNA feature data is in `keeper.data`You can check if data is in `keeper.data`, `keeper.distances`, or `keeper.similarities` by its key.\n",
"\n",
"For example, we next check if the RNA feature data is in `keeper.data`"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "9dbe2643-9864-4e9f-973d-ec0a147a7996",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'rna' in keeper.data"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:geo_env_test] *",
"language": "python",
"name": "conda-env-geo_env_test-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}