{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "29989051-ea7c-4d60-a442-fb06c4424235",
   "metadata": {},
   "source": [
    "# Keeper tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "539af752-a16b-43a1-8215-c39c17e216bd",
   "metadata": {},
   "source": [
    "The primary data structure used in `netflow` is called a `Keeper`. It is used to load, store, manipulate and save data for a set of observations. In particular, there are several specific types of keepers:\n",
    "\n",
    "- `DataKeeper` : handles feature data\n",
    "- `DistanceKeeper` : handles pairwise-observation distances (also used to handle pairwise observation similarities)\n",
    "- `GraphKeeper` : handles graphs (networks)\n",
    "\n",
    "Interacting with `netflow` will primarily entail making use of the predomenent `Keeper`, which implicitly makes use of the aforementioned specific keeper classes. This tutorial therefore focuses on the `Keeper` class, please see the documentation for more detail on the other Keeper classes. \n",
    "\n",
    "Data is organized in the `Keeper` class via the following attributes:\n",
    "\n",
    "- `self.oudir` : (directory path) : Path to directory where results will be saved.\n",
    "    - If not provided, no results can be saved.\n",
    "- `self.observation_labels` : (`list`) Observation labels are kept consistent across all feature data, distances and similarities.\n",
    "- `self.data` : (`DataKeeper`) Used to handle all feature data.\n",
    "- `self.distances` : (`DistanceKeeper`) Used to handle all observation-pairwise distances.\n",
    "- `self.similarities` : (`DistanceKeeper`) Used to handle all observation-pairwise similarities.\n",
    "- `self.graphs` : (`GraphKeeper`) Used to handle all graphs.\n",
    "- `self.misc` : (`dict`) Used to handle any miscellaneous data. \n",
    "    - Caution should be taken as observation labels and/or ordering of data stored in `self.misc` may not be consistent with the observations as tracked by the `Keeper`.\n",
    "\n",
    "We will now walk through some use-cases of how to make use of the `Keeper` class."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28099729-bd14-4b3a-9c08-f029482d5df7",
   "metadata": {},
   "source": [
    "First, import the necessary packages:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f994f736-ca88-43d8-bd70-44005fd29592",
   "metadata": {},
   "source": [
    "# Load libraries "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "379c16e0-418e-46c3-85cf-5ee3b52f58cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pathlib\n",
    "import sys\n",
    "\n",
    "from collections import defaultdict as ddict\n",
    "import itertools\n",
    "import matplotlib.pyplot as plt\n",
    "import networkx as nx\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import scipy.sparse as sc_sparse\n",
    "from tqdm import tqdm"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8c92cca-8e3a-4f6f-b07a-6bc7a9caf5d4",
   "metadata": {},
   "source": [
    "If ``netflow`` has not been installed, add the path to the library:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "3967f36c-d33a-4fa0-9060-059f01982074",
   "metadata": {},
   "outputs": [],
   "source": [
    "sys.path.insert(0, pathlib.Path(pathlib.Path('.').absolute()).parents[3].resolve().as_posix())\n",
    "# sys.path.insert(0, pathlib.Path(pathlib.Path('.').absolute()).parents[0].resolve().as_posix())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29c246ce-d1dc-4392-8686-62d77a9c9295",
   "metadata": {},
   "source": [
    "From the ``netflow`` package, we load the following modules:\n",
    " - The ``InfoNet`` class is used to compute 1-hop neighborhood distances\n",
    " - The ``Keeper`` class is used to store and manipulate data/results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "5b9f724f-13fa-4cab-bcc2-33b2698b1caf",
   "metadata": {},
   "outputs": [],
   "source": [
    "import netflow as nf\n",
    "\n",
    "# from netflow.keepers import keeper "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c377c3dc-d66d-417d-9da5-3e41e5e9882d",
   "metadata": {},
   "source": [
    "# Set up directories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "672bd1c1-7153-4db3-b7a3-3e686ce76c9a",
   "metadata": {},
   "outputs": [],
   "source": [
    "MAIN_DIR = pathlib.Path('.').absolute()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec4a2415-0a6f-481f-bc2c-14cc4b23f210",
   "metadata": {},
   "source": [
    "Paths to where data is stored:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "e86e42a8-fcf3-4732-b93e-343dd1cbb38e",
   "metadata": {},
   "outputs": [],
   "source": [
    "DATA_DIR = MAIN_DIR / 'example_data' / 'breast_tcga'\n",
    "\n",
    "RNA_FNAME = DATA_DIR / 'rna_606.txt'\n",
    "E_RNA_FNAME = DATA_DIR / 'edgelist_hprd_rna_606.txt'\n",
    "\n",
    "CNA_FNAME = DATA_DIR / 'cna_606.txt'\n",
    "E_CNA_FNAME = DATA_DIR / 'edgelist_hprd_cna_606.txt'\n",
    "\n",
    "METH_FNAME = DATA_DIR / 'methylation_606.txt'\n",
    "E_METH_FNAME = DATA_DIR / 'edgelist_hprd_methylation_606.txt'\n",
    "\n",
    "CLIN_FNAME = DATA_DIR / 'clin_606.txt'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5aa0e88c-1d2f-410c-9af3-8f311c6fc482",
   "metadata": {},
   "source": [
    "Directory where output should be saved:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "57c62445-d73b-4225-aa08-71aa81f16985",
   "metadata": {},
   "outputs": [],
   "source": [
    "OUT_DIR = MAIN_DIR / 'example_data' / 'results_netflow_breast_tcga'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b383f2a-5a4e-466d-804a-d63b234cb552",
   "metadata": {},
   "source": [
    "# Load data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4793b893-a186-492e-b985-6b07ec163ef1",
   "metadata": {},
   "source": [
    "We first load example TCGA breast cancer data that will be used to demonstrate how data can be loaded into the `Keeper`.\n",
    "\n",
    "Sample inclusion criteria (n=606):\n",
    "- Restricted to \n",
    "    - Primary samples\n",
    "    - Cancer type: Breast Cancer\n",
    "    - Detailed cancer type: Breast Invasive Ductal Carcinoma (IDC) or Breast Invasive Lobular Carcinoma (ILC)\n",
    "- Has reported overall survival status\n",
    "- Overall survival > 3.5 months (Figure 1)    \n",
    "    <!-- - <img src=\"attachment:6fc3253c-59fc-4cd5-b9d3-e407edd6a484.png\" alt=\"OS histogram\" width=\"200\"/> -->\n",
    "    <!-- - <img src=\"BC_overall_survival_hist.png\" alt=\"OS historgram\" width=\"200\"/> -->\n",
    "    ![OS histogram](BC_overall_survival_hist.png)\n",
    "\n",
    "    - distribution of age of patients excluded is similar to distribution of the age of all samples (Figure 2)\n",
    "    \n",
    "    ![age_histogram](BC_age_histogram.png)\n",
    "    <!-- <img src=\"BC_age_histogram.png\" alt=\"age histogram\" width=\"200\"/> -->\n",
    "    <!-- <img src=\"attachment:822ef528-aa96-48a6-80c5-12710e35b110.png\" alt=\"age histogram\" width=\"200\"/> -->\n",
    "- Has reported PAM50 subtype\n",
    "- Has RNA, CNA and methylation data\n",
    "\n",
    "- Feature data:\n",
    "    - RNA (7,148 genes)\n",
    "        - Remove genes that have zero expression in at least 20% (=121) of samples\n",
    "        - When multiple Entrez IDs map to the same gene, use the sum\n",
    "        - Restrict to largest connected component of genes in HPRD\n",
    "    - CNA (8,763 genes)\n",
    "        - When multiple Entrez IDs map to the same gene, select the more extreme value (i.e., in absolute value). If the absolute values are the same, select the loss.\n",
    "        - Restrict to largest connected component of genes in HPRD\n",
    "        - Translate from [-2, 2] -> [0, 4]\n",
    "    - methylation (7,969 genes)        \n",
    "        - Missing rate is at most 5.12% - used nearest-neighbor imputation computed in Python using the scikit-learn `KNNImputer` with default parameters except `weights=“distance”`.\n",
    "        - When multiple Entrez IDs map to the same gene, select the max.\n",
    "        - Restrict to largest connected component of genes in HPRD\n",
    "- Network edgelists (derived from HPRD):\n",
    "    - RNA edgelist (7,148 nodes and 27,498 edges)\n",
    "    - CNA edgelist (8,763 nodes and 34,906 edges)\n",
    "    - methylation edgelist (7,969 nodes and 31,112 edges)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "5b2b69ec-a949-4b2f-99c9-36fbf01a5ba5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(7148, 606)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TCGA-E9-A295-01</th>\n",
       "      <th>TCGA-AR-A1AS-01</th>\n",
       "      <th>TCGA-AQ-A1H2-01</th>\n",
       "      <th>TCGA-A8-A08O-01</th>\n",
       "      <th>TCGA-BH-A1FJ-01</th>\n",
       "      <th>TCGA-JL-A3YX-01</th>\n",
       "      <th>TCGA-A7-A425-01</th>\n",
       "      <th>TCGA-AC-A2BM-01</th>\n",
       "      <th>TCGA-LL-A6FP-01</th>\n",
       "      <th>TCGA-A7-A26E-01</th>\n",
       "      <th>...</th>\n",
       "      <th>TCGA-A2-A1G0-01</th>\n",
       "      <th>TCGA-WT-AB41-01</th>\n",
       "      <th>TCGA-EW-A1P6-01</th>\n",
       "      <th>TCGA-XX-A89A-01</th>\n",
       "      <th>TCGA-A7-A4SD-01</th>\n",
       "      <th>TCGA-AC-A6IX-01</th>\n",
       "      <th>TCGA-AR-A24L-01</th>\n",
       "      <th>TCGA-BH-A42U-01</th>\n",
       "      <th>TCGA-AR-A24S-01</th>\n",
       "      <th>TCGA-BH-A0BC-01</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ABCB8</th>\n",
       "      <td>910.2982</td>\n",
       "      <td>624.0432</td>\n",
       "      <td>422.2278</td>\n",
       "      <td>659.3773</td>\n",
       "      <td>836.3215</td>\n",
       "      <td>1202.2298</td>\n",
       "      <td>1264.0832</td>\n",
       "      <td>759.0133</td>\n",
       "      <td>2768.8172</td>\n",
       "      <td>560.3929</td>\n",
       "      <td>...</td>\n",
       "      <td>840.2367</td>\n",
       "      <td>1074.9211</td>\n",
       "      <td>769.4139</td>\n",
       "      <td>922.0595</td>\n",
       "      <td>597.0149</td>\n",
       "      <td>863.0025</td>\n",
       "      <td>451.0316</td>\n",
       "      <td>795.7087</td>\n",
       "      <td>606.1712</td>\n",
       "      <td>950.1840</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>SAT1</th>\n",
       "      <td>2727.0313</td>\n",
       "      <td>2731.2022</td>\n",
       "      <td>3701.4551</td>\n",
       "      <td>2259.7105</td>\n",
       "      <td>3085.1455</td>\n",
       "      <td>3915.4537</td>\n",
       "      <td>3230.3552</td>\n",
       "      <td>2042.0907</td>\n",
       "      <td>3258.4005</td>\n",
       "      <td>4837.8020</td>\n",
       "      <td>...</td>\n",
       "      <td>1728.7968</td>\n",
       "      <td>6252.3659</td>\n",
       "      <td>2843.2332</td>\n",
       "      <td>7392.0642</td>\n",
       "      <td>3602.4876</td>\n",
       "      <td>3878.9536</td>\n",
       "      <td>2552.8859</td>\n",
       "      <td>5193.3684</td>\n",
       "      <td>5891.7026</td>\n",
       "      <td>3594.8176</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2 rows × 606 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       TCGA-E9-A295-01  TCGA-AR-A1AS-01  TCGA-AQ-A1H2-01  TCGA-A8-A08O-01  \\\n",
       "ABCB8         910.2982         624.0432         422.2278         659.3773   \n",
       "SAT1         2727.0313        2731.2022        3701.4551        2259.7105   \n",
       "\n",
       "       TCGA-BH-A1FJ-01  TCGA-JL-A3YX-01  TCGA-A7-A425-01  TCGA-AC-A2BM-01  \\\n",
       "ABCB8         836.3215        1202.2298        1264.0832         759.0133   \n",
       "SAT1         3085.1455        3915.4537        3230.3552        2042.0907   \n",
       "\n",
       "       TCGA-LL-A6FP-01  TCGA-A7-A26E-01  ...  TCGA-A2-A1G0-01  \\\n",
       "ABCB8        2768.8172         560.3929  ...         840.2367   \n",
       "SAT1         3258.4005        4837.8020  ...        1728.7968   \n",
       "\n",
       "       TCGA-WT-AB41-01  TCGA-EW-A1P6-01  TCGA-XX-A89A-01  TCGA-A7-A4SD-01  \\\n",
       "ABCB8        1074.9211         769.4139         922.0595         597.0149   \n",
       "SAT1         6252.3659        2843.2332        7392.0642        3602.4876   \n",
       "\n",
       "       TCGA-AC-A6IX-01  TCGA-AR-A24L-01  TCGA-BH-A42U-01  TCGA-AR-A24S-01  \\\n",
       "ABCB8         863.0025         451.0316         795.7087         606.1712   \n",
       "SAT1         3878.9536        2552.8859        5193.3684        5891.7026   \n",
       "\n",
       "       TCGA-BH-A0BC-01  \n",
       "ABCB8         950.1840  \n",
       "SAT1         3594.8176  \n",
       "\n",
       "[2 rows x 606 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "rna = pd.read_csv(RNA_FNAME, header=0, index_col=0)\n",
    "print(rna.shape)\n",
    "\n",
    "display(rna.head(2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "84f5b6ec-38dd-42f6-96e3-9d5bdca220b6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source</th>\n",
       "      <th>target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ABCB8</td>\n",
       "      <td>SAT1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>SAT1</td>\n",
       "      <td>APLP1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  source target\n",
       "0  ABCB8   SAT1\n",
       "1   SAT1  APLP1"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Graph named 'rna' with 7148 nodes and 27498 edges\n"
     ]
    }
   ],
   "source": [
    "E_rna = pd.read_csv(E_RNA_FNAME, header=0)\n",
    "display(E_rna.head(2))\n",
    "\n",
    "G_rna = nx.from_pandas_edgelist(E_rna)\n",
    "G_rna.name = 'rna'\n",
    "print(G_rna)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "3077cc48-c8a9-4a53-b669-08f0d9497c4a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(8763, 606)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TCGA-E9-A295-01</th>\n",
       "      <th>TCGA-AR-A1AS-01</th>\n",
       "      <th>TCGA-AQ-A1H2-01</th>\n",
       "      <th>TCGA-A8-A08O-01</th>\n",
       "      <th>TCGA-BH-A1FJ-01</th>\n",
       "      <th>TCGA-JL-A3YX-01</th>\n",
       "      <th>TCGA-A7-A425-01</th>\n",
       "      <th>TCGA-AC-A2BM-01</th>\n",
       "      <th>TCGA-LL-A6FP-01</th>\n",
       "      <th>TCGA-A7-A26E-01</th>\n",
       "      <th>...</th>\n",
       "      <th>TCGA-A2-A1G0-01</th>\n",
       "      <th>TCGA-WT-AB41-01</th>\n",
       "      <th>TCGA-EW-A1P6-01</th>\n",
       "      <th>TCGA-XX-A89A-01</th>\n",
       "      <th>TCGA-A7-A4SD-01</th>\n",
       "      <th>TCGA-AC-A6IX-01</th>\n",
       "      <th>TCGA-AR-A24L-01</th>\n",
       "      <th>TCGA-BH-A42U-01</th>\n",
       "      <th>TCGA-AR-A24S-01</th>\n",
       "      <th>TCGA-BH-A0BC-01</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ABCB8</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>SAT1</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2 rows × 606 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       TCGA-E9-A295-01  TCGA-AR-A1AS-01  TCGA-AQ-A1H2-01  TCGA-A8-A08O-01  \\\n",
       "ABCB8                2                2                1                2   \n",
       "SAT1                 2                2                1                1   \n",
       "\n",
       "       TCGA-BH-A1FJ-01  TCGA-JL-A3YX-01  TCGA-A7-A425-01  TCGA-AC-A2BM-01  \\\n",
       "ABCB8                2                2                3                2   \n",
       "SAT1                 2                1                2                1   \n",
       "\n",
       "       TCGA-LL-A6FP-01  TCGA-A7-A26E-01  ...  TCGA-A2-A1G0-01  \\\n",
       "ABCB8                3                2  ...                3   \n",
       "SAT1                 2                2  ...                3   \n",
       "\n",
       "       TCGA-WT-AB41-01  TCGA-EW-A1P6-01  TCGA-XX-A89A-01  TCGA-A7-A4SD-01  \\\n",
       "ABCB8                2                2                2                1   \n",
       "SAT1                 2                2                1                1   \n",
       "\n",
       "       TCGA-AC-A6IX-01  TCGA-AR-A24L-01  TCGA-BH-A42U-01  TCGA-AR-A24S-01  \\\n",
       "ABCB8                2                2                2                2   \n",
       "SAT1                 2                2                2                2   \n",
       "\n",
       "       TCGA-BH-A0BC-01  \n",
       "ABCB8                2  \n",
       "SAT1                 2  \n",
       "\n",
       "[2 rows x 606 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "cna = pd.read_csv(CNA_FNAME, header=0, index_col=0)\n",
    "print(cna.shape)\n",
    "\n",
    "display(cna.head(2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "8d8a7d58-c052-4e96-a7fb-287c2a2018d3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source</th>\n",
       "      <th>target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ABCB8</td>\n",
       "      <td>SAT1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>SAT1</td>\n",
       "      <td>APLP1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  source target\n",
       "0  ABCB8   SAT1\n",
       "1   SAT1  APLP1"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Graph named 'cna' with 8763 nodes and 34906 edges\n"
     ]
    }
   ],
   "source": [
    "E_cna = pd.read_csv(E_CNA_FNAME, header=0)\n",
    "display(E_cna.head(2))\n",
    "\n",
    "G_cna = nx.from_pandas_edgelist(E_cna)\n",
    "G_cna.name = 'cna'\n",
    "print(G_cna)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17bbb5a1-77c8-44c1-9681-ce7df1fc21a5",
   "metadata": {},
   "source": [
    "# Initialize the Keeper"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6c0d38c-8727-46f6-9fbc-df1978252aaa",
   "metadata": {},
   "source": [
    "The `Keeper` can be instatiated with or without `outdir` - an output directory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "fad5a3fc-4c89-4494-afc1-364f7e0864d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# uncomment to initialize Keeper with no output directory:\n",
    "# keeper = nf.Keeper() \n",
    "\n",
    "# initializing Keeper with output directory:\n",
    "keeper = nf.Keeper(outdir=OUT_DIR)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "419f6157-29f3-44ef-8523-c44fae08848b",
   "metadata": {},
   "source": [
    "See the documentation for more details on initializing the `Keeper`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f281c332-33cd-423a-9b04-1436b3b87a53",
   "metadata": {},
   "source": [
    "# Load data into the Keeper"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13375e59-4a54-4a9b-9796-ce729dd1be63",
   "metadata": {},
   "source": [
    "Currently, data is expected to be either in the form of a `numpy.ndarray` or `pandas.DataFrame`, or saved in a file that is loadable via `pandas.read_csv`, with observations as columns and features as rows."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3f5b567-63dc-461f-babc-63a8d5c51fc2",
   "metadata": {},
   "source": [
    "Data can be loaded into the Keeper in two ways: \n",
    "    i. From a `numpy.ndarray` or `pandas.DataFrame` or\n",
    "    ii. Directly from a file that is loadable via `pandas.read_csv`\n",
    "    \n",
    "Note: Future releases will handle various file types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "112296ad-452d-4180-bc1c-2eecd3086c6d",
   "metadata": {},
   "outputs": [],
   "source": [
    "keeper.add_data(rna, 'rna')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b01a2a5-fc23-4344-898b-aa417bfae547",
   "metadata": {},
   "source": [
    "The first time data is loaded into `keeper`, it sets the observation labels that must be consistent with any other feature data set, pairwise-observation distances data and pairwise-observation similarities data that will be added to `keeper` here on out. \n",
    "\n",
    "If instead the feature data was provided as a `numpy.ndarray`, observation labels default to `'X0', 'X1', ...`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2808dba-f8be-4248-a936-395d286c5cb4",
   "metadata": {},
   "source": [
    "Within the `netflow` environment, this data can be specified by specifying the data set's reference label or __key__, e.g., `key='rna'`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5fd904e-37a6-4ede-b5f2-dd3d4de72d27",
   "metadata": {},
   "source": [
    "We see that `keeper` has updated the observation labels:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "3edddfbd-e14a-4433-a5e1-9c6e3199ff4b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# uncomment to print out all observation labels:\n",
    "# keeper.observation_labels"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c246c5a6-80b3-4acb-9bba-8857528f2c6e",
   "metadata": {},
   "source": [
    "We can also upload the cna data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "8f06d0a6-e69d-48d1-89b1-eb43fc1f6296",
   "metadata": {},
   "outputs": [],
   "source": [
    "keeper.add_data(cna, 'cna')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e99dc9f8-c3b5-4d08-aab3-a2ad16ba56c6",
   "metadata": {},
   "source": [
    "Similarly, we can add the graphs associated with the RNA data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "6cbee537-cb5f-405d-9896-d83d5407f579",
   "metadata": {},
   "outputs": [],
   "source": [
    "keeper.add_graph(G_rna, 'rna')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8898b35-9add-48ae-9081-3694f32f6976",
   "metadata": {},
   "source": [
    "If we had pairwise-observation distances or similarities, we could add them to `keeper` in the same manner using the methods:\n",
    "\n",
    "- `keeper.add_distance()`\n",
    "- `keeper.add_similarity()`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "521084b4-12c7-4fb3-9369-d19081df0bd8",
   "metadata": {},
   "source": [
    "Suppose we had some miscellaneous data, for example a list of genes of interest, that we wanted to store in the `Keeper`, this could be done as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "79a9a0aa-3823-461a-b384-d0b491de3733",
   "metadata": {},
   "outputs": [],
   "source": [
    "gene_list = ['SLC3A1', 'TIMM13', 'STYX', 'MUC6', 'TIMM17B', 'TNFSF18', 'TOMM7', 'TOMM40',\n",
    "             'TRAPPC2L', 'AZU1', 'TUBB1', 'XCR1', 'VPS33B', 'XCL2', 'KNTC1']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "4e9b135f-bcd1-4835-adc2-b7e732f46a72",
   "metadata": {},
   "outputs": [],
   "source": [
    "keeper.add_misc(gene_list, 'my_gene_list')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b7a7d54c-9733-4e24-9ded-b4e3c3389ba9",
   "metadata": {},
   "source": [
    "# Load data into keeper at time of instatiation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea40f617-6e25-44e9-ab05-fad88f93121f",
   "metadata": {},
   "source": [
    "Alternatively, you can specify any data, distances, similarities, graphs, or miscelaneous data to be loaded into the `Keeper` at the same time as it's initialized. For example, we initialize the `Keeper` and load the RNA and CNA feature data and their associated graphs in one call:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "4561208d-fdc1-4432-97c4-fd6edd593c23",
   "metadata": {},
   "outputs": [],
   "source": [
    "keeper = nf.Keeper(data={'rna': rna, 'cna': cna}, graphs={'rna': G_rna, 'cna': G_cna})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "966f5765-08e6-4615-8303-4ae9cff3b1e5",
   "metadata": {},
   "source": [
    "See the documentation on additional options for initializing and loading data into a `Keeper`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a80cfc7-a6ce-4f2e-9cf9-70b92b2fee69",
   "metadata": {},
   "source": [
    "# Load data into Keeper from file"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac5f3e64-a218-47d0-8436-f31bb14dcd03",
   "metadata": {},
   "source": [
    "Alternatively, we can load data, distances, similarities, and graphs directly from a file. \n",
    "\n",
    "Currently, file formats that can be loaded by `pandas.read_csv` are accepted. Future releases will offer additional file types."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ce2f23e-1ea2-432d-8034-125f53c0bb5e",
   "metadata": {},
   "source": [
    "We start by initializing a `Keeper` and then load the RNA and CNA data from file. The argument `label` is used to specify how the data is referenced in the `Keeper`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "7ffd2461-5709-49dc-b0f8-0eab0d16d81e",
   "metadata": {},
   "outputs": [],
   "source": [
    "keeper = nf.Keeper(outdir=OUT_DIR)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "8fa87b90-b639-41c9-92b7-3b83acc4f59e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add RNA and CNA feature data:\n",
    "keeper.load_data(RNA_FNAME, label='rna', header=0, index_col=0) \n",
    "keeper.load_data(CNA_FNAME, label='cna', header=0, index_col=0, dtype=float)\n",
    "\n",
    "# uncomment to add methylation feature data:\n",
    "# keeper.load_data(METH_FNAME, label='meth', header=0, index_col=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "263aa444-a4b3-4b50-a932-8e3afb562d09",
   "metadata": {},
   "source": [
    "We next add the graphs associated with RNA and CNA. (See the documentation for more details and the expected edgelist format.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "690e8e4b-6b5c-4f06-9d33-4eaae69d43c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "keeper.load_graph(E_RNA_FNAME, label='rna')\n",
    "keeper.load_graph(E_CNA_FNAME, label='cna')\n",
    "\n",
    "# uncomment to add methylation graph:\n",
    "# keeper.load_graph(E_METH_FNAME, label='meth')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4be7843b-3e74-4590-be6e-ab9f7b735592",
   "metadata": {},
   "source": [
    "Similarities and distances can be loaded from file into the `Keeper` in the same manner via:\n",
    "\n",
    "- `keeper.load_distance()` or `keeper.load_stacked_distance()`\n",
    "- `keeper.load_similarity()` or `keeper.load_stacked_similarity()`\n",
    "\n",
    "See the documentaion for more details."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "070f5941-f984-41cc-9cc4-a38c7e18a054",
   "metadata": {},
   "source": [
    "# Extract data from the Keeper"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0299f1a-af22-44fa-bb6b-d6a047948074",
   "metadata": {},
   "source": [
    "We can extract the data in the form of a `pandas.DataFrame` as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "a2ea3efb-1096-42c7-8204-81455cf9162a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TCGA-E9-A295-01</th>\n",
       "      <th>TCGA-AR-A1AS-01</th>\n",
       "      <th>TCGA-AQ-A1H2-01</th>\n",
       "      <th>TCGA-A8-A08O-01</th>\n",
       "      <th>TCGA-BH-A1FJ-01</th>\n",
       "      <th>TCGA-JL-A3YX-01</th>\n",
       "      <th>TCGA-A7-A425-01</th>\n",
       "      <th>TCGA-AC-A2BM-01</th>\n",
       "      <th>TCGA-LL-A6FP-01</th>\n",
       "      <th>TCGA-A7-A26E-01</th>\n",
       "      <th>...</th>\n",
       "      <th>TCGA-A2-A1G0-01</th>\n",
       "      <th>TCGA-WT-AB41-01</th>\n",
       "      <th>TCGA-EW-A1P6-01</th>\n",
       "      <th>TCGA-XX-A89A-01</th>\n",
       "      <th>TCGA-A7-A4SD-01</th>\n",
       "      <th>TCGA-AC-A6IX-01</th>\n",
       "      <th>TCGA-AR-A24L-01</th>\n",
       "      <th>TCGA-BH-A42U-01</th>\n",
       "      <th>TCGA-AR-A24S-01</th>\n",
       "      <th>TCGA-BH-A0BC-01</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ABCB8</th>\n",
       "      <td>910.2982</td>\n",
       "      <td>624.0432</td>\n",
       "      <td>422.2278</td>\n",
       "      <td>659.3773</td>\n",
       "      <td>836.3215</td>\n",
       "      <td>1202.2298</td>\n",
       "      <td>1264.0832</td>\n",
       "      <td>759.0133</td>\n",
       "      <td>2768.8172</td>\n",
       "      <td>560.3929</td>\n",
       "      <td>...</td>\n",
       "      <td>840.2367</td>\n",
       "      <td>1074.9211</td>\n",
       "      <td>769.4139</td>\n",
       "      <td>922.0595</td>\n",
       "      <td>597.0149</td>\n",
       "      <td>863.0025</td>\n",
       "      <td>451.0316</td>\n",
       "      <td>795.7087</td>\n",
       "      <td>606.1712</td>\n",
       "      <td>950.1840</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>SAT1</th>\n",
       "      <td>2727.0313</td>\n",
       "      <td>2731.2022</td>\n",
       "      <td>3701.4551</td>\n",
       "      <td>2259.7105</td>\n",
       "      <td>3085.1455</td>\n",
       "      <td>3915.4537</td>\n",
       "      <td>3230.3552</td>\n",
       "      <td>2042.0907</td>\n",
       "      <td>3258.4005</td>\n",
       "      <td>4837.8020</td>\n",
       "      <td>...</td>\n",
       "      <td>1728.7968</td>\n",
       "      <td>6252.3659</td>\n",
       "      <td>2843.2332</td>\n",
       "      <td>7392.0642</td>\n",
       "      <td>3602.4876</td>\n",
       "      <td>3878.9536</td>\n",
       "      <td>2552.8859</td>\n",
       "      <td>5193.3684</td>\n",
       "      <td>5891.7026</td>\n",
       "      <td>3594.8176</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ABCF3</th>\n",
       "      <td>1310.1533</td>\n",
       "      <td>1381.3597</td>\n",
       "      <td>1413.9488</td>\n",
       "      <td>1078.8295</td>\n",
       "      <td>992.8420</td>\n",
       "      <td>1273.1496</td>\n",
       "      <td>1344.0976</td>\n",
       "      <td>1256.8570</td>\n",
       "      <td>2183.8038</td>\n",
       "      <td>1067.4277</td>\n",
       "      <td>...</td>\n",
       "      <td>1055.8843</td>\n",
       "      <td>2185.3312</td>\n",
       "      <td>1145.1250</td>\n",
       "      <td>1031.6486</td>\n",
       "      <td>1705.9701</td>\n",
       "      <td>1292.1338</td>\n",
       "      <td>924.5234</td>\n",
       "      <td>1223.4741</td>\n",
       "      <td>1003.6413</td>\n",
       "      <td>1278.0836</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3 rows × 606 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       TCGA-E9-A295-01  TCGA-AR-A1AS-01  TCGA-AQ-A1H2-01  TCGA-A8-A08O-01  \\\n",
       "ABCB8         910.2982         624.0432         422.2278         659.3773   \n",
       "SAT1         2727.0313        2731.2022        3701.4551        2259.7105   \n",
       "ABCF3        1310.1533        1381.3597        1413.9488        1078.8295   \n",
       "\n",
       "       TCGA-BH-A1FJ-01  TCGA-JL-A3YX-01  TCGA-A7-A425-01  TCGA-AC-A2BM-01  \\\n",
       "ABCB8         836.3215        1202.2298        1264.0832         759.0133   \n",
       "SAT1         3085.1455        3915.4537        3230.3552        2042.0907   \n",
       "ABCF3         992.8420        1273.1496        1344.0976        1256.8570   \n",
       "\n",
       "       TCGA-LL-A6FP-01  TCGA-A7-A26E-01  ...  TCGA-A2-A1G0-01  \\\n",
       "ABCB8        2768.8172         560.3929  ...         840.2367   \n",
       "SAT1         3258.4005        4837.8020  ...        1728.7968   \n",
       "ABCF3        2183.8038        1067.4277  ...        1055.8843   \n",
       "\n",
       "       TCGA-WT-AB41-01  TCGA-EW-A1P6-01  TCGA-XX-A89A-01  TCGA-A7-A4SD-01  \\\n",
       "ABCB8        1074.9211         769.4139         922.0595         597.0149   \n",
       "SAT1         6252.3659        2843.2332        7392.0642        3602.4876   \n",
       "ABCF3        2185.3312        1145.1250        1031.6486        1705.9701   \n",
       "\n",
       "       TCGA-AC-A6IX-01  TCGA-AR-A24L-01  TCGA-BH-A42U-01  TCGA-AR-A24S-01  \\\n",
       "ABCB8         863.0025         451.0316         795.7087         606.1712   \n",
       "SAT1         3878.9536        2552.8859        5193.3684        5891.7026   \n",
       "ABCF3        1292.1338         924.5234        1223.4741        1003.6413   \n",
       "\n",
       "       TCGA-BH-A0BC-01  \n",
       "ABCB8         950.1840  \n",
       "SAT1         3594.8176  \n",
       "ABCF3        1278.0836  \n",
       "\n",
       "[3 rows x 606 columns]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = keeper.data['rna'].to_frame()\n",
    "df.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9a516e9-e9b4-4fe5-ae04-731fbc8fe587",
   "metadata": {},
   "source": [
    "Similarly, this can be done for any key in the distance or similarity keeper in the form:\n",
    "\n",
    "- `keeper.distances[key].to_frame()`\n",
    "- `keeper.similarities[key].to_frame()`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98c383f1-5bbe-47b6-9ea6-4b51d351d2ed",
   "metadata": {},
   "source": [
    "Next, we demonstrate how to get a graph that has been stored in `keeper`, keyed by its reference label:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "6ef786bd-d337-4677-aa4a-5df003e77d6c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Graph named 'rna' with 7148 nodes and 27498 edges\n"
     ]
    }
   ],
   "source": [
    "G = keeper.graphs['rna']\n",
    "print(G)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e974df7-deb9-46cc-9c15-e65132a927e6",
   "metadata": {},
   "source": [
    "Similarly, we can access miscellaneous data stored in `keeper`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "27e0543e-7e0e-41bb-8f33-737d2e873e04",
   "metadata": {},
   "outputs": [],
   "source": [
    "# first we add the gene list to the keeper:\n",
    "keeper.add_misc(gene_list, 'my_gene_list')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "7ccf5ce6-0e32-4d16-82f0-6e19b1d57825",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['SLC3A1',\n",
       " 'TIMM13',\n",
       " 'STYX',\n",
       " 'MUC6',\n",
       " 'TIMM17B',\n",
       " 'TNFSF18',\n",
       " 'TOMM7',\n",
       " 'TOMM40',\n",
       " 'TRAPPC2L',\n",
       " 'AZU1',\n",
       " 'TUBB1',\n",
       " 'XCR1',\n",
       " 'VPS33B',\n",
       " 'XCL2',\n",
       " 'KNTC1']"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# and to access the gene list:\n",
    "keeper.misc['my_gene_list']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81e7ffd6-32ad-41d3-b374-78a90a3953f2",
   "metadata": {},
   "source": [
    "# Save data from the Keeper"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b074dc9-2937-4037-a4ef-675ccf0d3b03",
   "metadata": {},
   "source": [
    "Data can be saved from:\n",
    "- `keeper.data` via `keeper.save_data()`\n",
    "- `keeper.distances` via `keeper.save_distance()`\n",
    "- `keeper.similarities` via `keeper.save_similarity()`\n",
    "- `keeper.misc` (if it is a `pandas.DataFrame`) via `keeper.save_misc()`\n",
    "\n",
    "Currently, it is saved using `pandas.to_csv`. Future releases will provide additional formats for saving data.\n",
    "\n",
    "In order to save data, `keeper` must have the attribute `outdir` defined.\n",
    "\n",
    "The data is then saved to a file named `{outdir}/{data_type}_{label}.{file_format}` where\n",
    "- `outdir` : The keeper's output directory: `keeper.outdir`.\n",
    "- `data_type` : This is one of {'data', 'distance', 'similarity', 'misc'}, depending on which method is called to save the data.\n",
    "- `label` : This is reference label for the data in the keeper that should be saved, specified by the user.\n",
    "- `file_format` This is the file extension, provided by the user (default = 'txt'). \n",
    "\n",
    "See the documentation for more details."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f025109-eb53-4627-89ec-e1d3746ede49",
   "metadata": {},
   "source": [
    "For example, to save the RNA data to the specified output directory, you would call:\n",
    "\n",
    "`keeper.save_data('rna')`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c4b2b00-9abb-427c-a861-b5f1bd7d2203",
   "metadata": {},
   "source": [
    "# Extract Keeper subset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "acc4703c-2619-4acb-964a-8473c02f11b6",
   "metadata": {},
   "source": [
    "You can a new `Keeper` instance with a subset of the observations from the `keeper`. \n",
    "\n",
    "Caution should be taken for the resulting `misc` and `graphs` keepers, as they are maintained independent of the observations. You can select if the misc and graphs should be copied into the the new subset Keeper, as is.  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0404eb6-222d-49bb-91fa-9f40209bae04",
   "metadata": {},
   "source": [
    "You can also specify the output directory `outdir` for the new keeper subset. Default is `None`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e8758aa-e2ff-4f19-9c56-263aed2337da",
   "metadata": {},
   "source": [
    "For example, we can extract a subset of `keeper` with the first 3 observations, using `keep_misc=True` and `keep_graphs=True` to keep a copy of miscellaneous data and the graphs. Additionally, we use `outdir` to specify the output directory we want for this subset of observations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "54c86564-a8d3-48b5-8539-65c9dae2ecc2",
   "metadata": {},
   "outputs": [],
   "source": [
    "outdir_sub = OUT_DIR.parent / (OUT_DIR.name + '_sub')\n",
    "\n",
    "keeper_sub = keeper.subset(keeper.observation_labels[:10], \n",
    "                           keep_misc=True, keep_graphs=True,\n",
    "                           outdir=outdir_sub)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50a3f266-e1cd-4363-955c-81973d0394f1",
   "metadata": {},
   "source": [
    "# Extract data subset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebdf1642-a9a2-41b6-be62-75b530db3f2f",
   "metadata": {},
   "source": [
    "You can extract a subset of observations and features for any data stored in the data-keeper as a `pandas.DataFrame`: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "c0572c60-8bcb-4259-8c78-69dac3ec4b7d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TCGA-E9-A295-01</th>\n",
       "      <th>TCGA-AR-A1AS-01</th>\n",
       "      <th>TCGA-AQ-A1H2-01</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>SLC3A1</th>\n",
       "      <td>4.8292</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>3.1360</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TIMM13</th>\n",
       "      <td>1078.3533</td>\n",
       "      <td>992.7960</td>\n",
       "      <td>708.2288</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>STYX</th>\n",
       "      <td>829.6511</td>\n",
       "      <td>542.0982</td>\n",
       "      <td>403.1611</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MUC6</th>\n",
       "      <td>1.4488</td>\n",
       "      <td>130.1216</td>\n",
       "      <td>0.0000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TIMM17B</th>\n",
       "      <td>827.7194</td>\n",
       "      <td>573.1652</td>\n",
       "      <td>739.5886</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         TCGA-E9-A295-01  TCGA-AR-A1AS-01  TCGA-AQ-A1H2-01\n",
       "SLC3A1            4.8292           0.0000           3.1360\n",
       "TIMM13         1078.3533         992.7960         708.2288\n",
       "STYX            829.6511         542.0982         403.1611\n",
       "MUC6              1.4488         130.1216           0.0000\n",
       "TIMM17B         827.7194         573.1652         739.5886"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rna_sub = keeper.data['rna'].subset(observations=keeper.observation_labels[:3],\n",
    "                                    features=gene_list[:5])\n",
    "rna_sub"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0275c9f0-b067-45b1-aaad-637e414c43ab",
   "metadata": {},
   "source": [
    "Similarly, you can extract distances (or similarities) between a subset of observations as a `pandas.DataFrame`. We'll first add a distance to the keeper to demonstrate this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "6015db82-ec4c-4a62-a194-cd691c772d08",
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.spatial.distance import cdist"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "e8741860-5716-4999-90dc-8f1964ec4c46",
   "metadata": {},
   "outputs": [],
   "source": [
    "rna_euc = pd.DataFrame(data=cdist(rna.T, rna.T, metric='euclidean'),\n",
    "                       index=rna.columns.copy(), columns=rna.columns.copy())\n",
    "keeper.add_distance(rna_euc, 'rna_euc')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "4e5acca6-63c7-4094-81cb-5ba27cd4edce",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TCGA-E9-A295-01</th>\n",
       "      <th>TCGA-AR-A1AS-01</th>\n",
       "      <th>TCGA-AQ-A1H2-01</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>TCGA-E9-A295-01</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>351149.925916</td>\n",
       "      <td>392258.653370</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-AR-A1AS-01</th>\n",
       "      <td>351149.925916</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>382107.888105</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-AQ-A1H2-01</th>\n",
       "      <td>392258.653370</td>\n",
       "      <td>382107.888105</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 TCGA-E9-A295-01  TCGA-AR-A1AS-01  TCGA-AQ-A1H2-01\n",
       "TCGA-E9-A295-01         0.000000    351149.925916    392258.653370\n",
       "TCGA-AR-A1AS-01    351149.925916         0.000000    382107.888105\n",
       "TCGA-AQ-A1H2-01    392258.653370    382107.888105         0.000000"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "keeper.distances['rna_euc'].subset(keeper.observation_labels[:3])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1d7957c-5e47-4b5b-a872-e98ed44deb2a",
   "metadata": {},
   "source": [
    "The same can be done for a similarity stored in `keeper` using `keeper.similarities` in place of `keeper.distances`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "812e6da6-c8f9-43f0-a51e-8a673e13bebc",
   "metadata": {},
   "source": [
    "# Iterating through feature data, distances or similarities"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "001757b6-2ef3-4a24-9052-c329007b3623",
   "metadata": {},
   "source": [
    "Before demonstrating how to iterate through the data, we review some of the `Keeper` class properties in a bit more detail: \n",
    "\n",
    "When we instatiate a `Keeper` object: `keeper = netflow.Keeper()`, \n",
    "the `keeper` is initialized with an instance of the `DataKeeper` class assigned as `keeper.data` and two instances of the `DistanceKeeper` class assigned as `keeper.distances` and `keeper.similarities`. Data is added to the `DataKeeper` and `DistanceKeeper` with a reference key which may then be accessed in the same manner as retrieving a value from a `dict`. However, the `DataKeeper` and `DistanceKeeper` maintain a bit more information than just the data itself to regulate observation (and feature) properties. Therefore, keyed-accessing data from `DataKeeper` or `DistanceKeeper` returns an instance of `DataView` or `DistanceView`, respectively (instead of the original input data). \n",
    "\n",
    "For example: `x = keeper.data['rna']`   \n",
    "Here, `x` is an instance of the `DataView` class.   \n",
    "The RNA data itself (as a `numpy.ndarray`) can be accessed via `x.data`.    \n",
    "And previously, we demonstrated how to extract the data as a `pandas.DataFrame`: `x.to_frame()`. (The `.to_frame()` property is actually a method of the `DataView` (and `DistanceView`) class.)\n",
    "\n",
    "You can iterate over the data stored in a `DataKeeper` and `DistanceKeeper`, which yields an instance of `DataView` and `DistanceView`, repectively, at each iteration.\n",
    "\n",
    "Therefore, you can iterate over the feature data, distances and similarities. The process is the same for each type of store data, so we demonstrate this on the feature data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea16b1b6-96f0-491e-aac8-3b246641e3c2",
   "metadata": {},
   "source": [
    "Here we iterate through all the feature data and print the keyed-label:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "2f983bc5-bf6d-4b10-a28c-941bdaf95555",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rna\n",
      "cna\n"
     ]
    }
   ],
   "source": [
    "for dd in keeper.data:\n",
    "    print(dd.label)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "adf336a9-e8a5-4303-868e-c57b0f6fe2e8",
   "metadata": {},
   "source": [
    "Here we iterate through all the feature data and print out the number of features:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "ed462bf2-2c2f-43c1-9b8f-b0a40694d133",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7148\n",
      "8763\n"
     ]
    }
   ],
   "source": [
    "for dd in keeper.data:\n",
    "    print(dd.num_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb4024be-baa7-4d1d-ac12-ad4da4063350",
   "metadata": {},
   "source": [
    "Here we iterate through all the feature data and print out the first 4 feature labels:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "804ef414-5812-4f36-a29b-e8e39fc6a450",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['ABCB8', 'SAT1', 'ABCF3', 'ARF1']\n",
      "['ABCB8', 'SAT1', 'ABCF3', 'ARF1']\n"
     ]
    }
   ],
   "source": [
    "for dd in keeper.data:\n",
    "    print(dd.feature_labels[:4])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6392e1c-69e0-4430-ad37-f122db11c884",
   "metadata": {},
   "source": [
    "Here we iterate through all the feature data and print out the stored data for the first 3 observations and first 4 features:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "236c58fe-0115-4175-8176-d280ef2dabe2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[  910.2982   624.0432   422.2278]\n",
      " [ 2727.0313  2731.2022  3701.4551]\n",
      " [ 1310.1533  1381.3597  1413.9488]\n",
      " [15602.5595 17668.1675 10149.0216]]\n",
      "[[2. 2. 1.]\n",
      " [2. 2. 1.]\n",
      " [2. 3. 2.]\n",
      " [3. 3. 2.]]\n"
     ]
    }
   ],
   "source": [
    "for dd in keeper.data:\n",
    "    print(dd.data[:4, :3])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92e71373-0acf-4435-a16c-b4d55b1e426c",
   "metadata": {},
   "source": [
    "You can also use the `key` attribute to see the labels of data stored in any of the keepers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "295cca36-756b-44fe-81fa-993ab14ce417",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['rna', 'cna'])"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "keeper.data.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "071fdc62-6913-48db-a2fb-6006284bc07b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['rna_euc'])"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "keeper.distances.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "4c2a18a6-69c9-412d-993d-9a5500fab11f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['rna', 'cna'])"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "keeper.graphs.keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5200047-9b87-467e-9483-2adcd0f35c27",
   "metadata": {},
   "source": [
    "# Membership"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b3b96c8-9f68-456a-b952-11aede03f789",
   "metadata": {},
   "source": [
    "You can check if data is in `keeper.data`, `keeper.distances`, or `keeper.similarities` by its key.\n",
    "\n",
    "For example, we next check if the RNA feature data is in `keeper.data`You can check if data is in `keeper.data`, `keeper.distances`, or `keeper.similarities` by its key.\n",
    "\n",
    "For example, we next check if the RNA feature data is in `keeper.data`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "9dbe2643-9864-4e9f-973d-ec0a147a7996",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'rna' in keeper.data"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:geo_env_test] *",
   "language": "python",
   "name": "conda-env-geo_env_test-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}