{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MuData quickstart\n", "\n", "This notebooks provides an introduction to multimodal data objects." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gtca/muon/blob/master/docs/source/notebooks/quickstart_mudata.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`muon` is a framework for multimodal data analysis with a strong focus on multi-omics. \n", "\n", "With its multimodal objects built on top of [AnnData](https://anndata.readthedocs.io/en/latest/index.html) and state-of-the-art integration methods built in, `muon` fits naturally into the rich Python ecosystem for data analysis, and its modular design for the analysis of the individual omics assays provides necessary functionality for the common workflows out of the box." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import muon as mu\n", "from muon import MuData" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multimodal objects" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To see how multimodal objects behave, we will simulate some data first:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1000, 100)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "np.random.seed(1)\n", "\n", "n, d, k = 1000, 100, 10\n", "\n", "z = np.random.normal(loc=np.arange(k), scale=np.arange(k) * 2, size=(n, k))\n", "w = np.random.normal(size=(d, k))\n", "y = np.dot(z, w.T)\n", "y.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating an `AnnData` object from the matrix will allow us to add annotations to its different dimensions (_\"observations\"_, e.g. samples, and measured _\"variables\"_):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 1000 × 100" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from anndata import AnnData\n", "\n", "adata = AnnData(y)\n", "adata.obs_names = [f\"obs_{i+1}\" for i in range(n)]\n", "adata.var_names = [f\"var_{j+1}\" for j in range(d)]\n", "adata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will go ahead and create a second object with data for the _same observations_ but for _different variables_:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 1000 × 50" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d2 = 50\n", "w2 = np.random.normal(size=(d2, k))\n", "y2 = np.dot(z, w2.T)\n", "\n", "adata2 = AnnData(y2)\n", "adata2.obs_names = [f\"obs_{i+1}\" for i in range(n)]\n", "adata2.var_names = [f\"var2_{j+1}\" for j in range(d2)]\n", "adata2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now wrap these two objects into a `MuData` object:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
MuData object with n_obs × n_vars = 1000 × 150\n",
       "  2 modalities\n",
       "    A:\t1000 x 100\n",
       "    B:\t1000 x 50
" ], "text/plain": [ "MuData object with n_obs × n_vars = 1000 × 150\n", " 2 modalities\n", " A:\t1000 x 100\n", " B:\t1000 x 50" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mdata = MuData({\"A\": adata, \"B\": adata2})\n", "mdata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_Observations_ and _variables_ of the `MuData` object are global, which means that observations with the identical name (`.obs_names`) in different modalities are considered to be the same observation. This also means variable names (`.var_names`) should be unique.\n", "\n", "This is reflected in the object description above: `mdata` has 1000 _observations_ and 150=100+50 _variables_." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Variable mappings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Upon construction of a `MuData` object, a global binary mapping between _observations_ and individual modalities is created as well as between _variables_ and modalities.\n", "\n", "Since all the observations are the same across modalities in `mdata`, all the values in the _observations_ mappings are set to `True`:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sum(mdata.obsm[\"A\"]) == np.sum(mdata.obsm[\"B\"]) == n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For variables, those are 150-long vectors, e.g. for the `A` modality — with 100 `True` values followed by 50 `False` values:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, False, False, False, False, False, False, False, False,\n", " False, False, False, False, False, False, False, False, False,\n", " False, False, False, False, False, False, False, False, False,\n", " False, False, False, False, False, False, False, False, False,\n", " False, False, False, False, False, False, False, False, False,\n", " False, False, False, False, False, False])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mdata.varm[\"A\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Object references" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Importantly, individual modalities are stored as references to the original objects." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 1000 × 54" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Only keep variables with value > 1 in obs_1\n", "# with in-place filtering for the variables\n", "mu.pp.filter_var(adata, adata[\"obs_1\", :].X.flatten() > 1)\n", "adata" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 1000 × 54" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Modalities can be accessed within the .mod attributes\n", "mdata.mod[\"A\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is also why the `MuData` object has to be updated in order to reflect the latest changes to the modalities it includes:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Outdated size: 100\n", "Updated size: 54\n" ] } ], "source": [ "print(f\"Outdated size:\", mdata.varm[\"A\"].sum())\n", "mdata.update()\n", "print(f\"Updated size:\", mdata.varm[\"A\"].sum())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Common observations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While `mdata` is comprised of the same observations for both modalities, it is not always the case in the real world where some data might be missing. By design, `muon` accounts for these scenarios since there's no guarantee observations are the same — or even intersecting — for a `MuData` instance." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Throw away the last sample in the modality 'B'\n", "# with in-place filtering for the observations\n", "mu.pp.filter_obs(mdata.mod[\"B\"], [True for _ in range(n - 1)] + [False])" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
MuData object with n_obs × n_vars = 1000 × 104\n",
       "  2 modalities\n",
       "    A:\t1000 x 54\n",
       "    B:\t999 x 50
" ], "text/plain": [ "MuData object with n_obs × n_vars = 1000 × 104\n", " 2 modalities\n", " A:\t1000 x 54\n", " B:\t999 x 50" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# adata2 object has also changed\n", "assert mdata.mod[\"B\"].shape == adata2.shape\n", "\n", "mdata.update()\n", "mdata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`muon` provides, however, a simple function to drop the observations that are not present in all the modalities:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
MuData object with n_obs × n_vars = 999 × 104\n",
       "  2 modalities\n",
       "    A:\t999 x 54\n",
       "    B:\t999 x 50
" ], "text/plain": [ "MuData object with n_obs × n_vars = 999 × 104\n", " 2 modalities\n", " A:\t999 x 54\n", " B:\t999 x 50" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mu.pp.intersect_obs(mdata)\n", "mdata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Rich representation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some notebook environments such as Jupyter/IPython allow for the [rich object representation](https://ipython.readthedocs.io/en/stable/config/integrating.html). This is what `muon` uses in order to provide an optional HTML representation that allows to interactively explore `MuData` objects. While the dataset in our example is not the most comprehensive one, here is how it looks like:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "MuData object 999 obs × 104 var in 2 modalities
Metadata
.obs0 elements
No metadata
Embeddings & mappings
.obsm2 elements
\n", " \n", " \n", "\n", " \n", "
A bool numpy.ndarray
B bool numpy.ndarray
Distances
.obsp0 elements
No distances
A
999 × 54
AnnData object 999 obs × 54 var
Matrix
.X
\n", " float32    numpy.ndarray\n", "
Layers
.layers0 elements
No layers
Metadata
.obs0 elements
No metadata
Embeddings
.obsm0 elements
No embeddings
Distances
.obsp0 elements
No distances
Miscellaneous
.uns0 elements
No miscellaneous
B
999 × 50
AnnData object 999 obs × 50 var
Matrix
.X
\n", " float32    numpy.ndarray\n", "
Layers
.layers0 elements
No layers
Metadata
.obs0 elements
No metadata
Embeddings
.obsm0 elements
No embeddings
Distances
.obsp0 elements
No distances
Miscellaneous
.uns0 elements
No miscellaneous

" ], "text/plain": [ "MuData object with n_obs × n_vars = 999 × 104\n", " 2 modalities\n", " A:\t999 x 54\n", " B:\t999 x 50" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "with mu.set_options(display_style=\"html\", display_html_expand=0b000):\n", " display(mdata)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Running `mu.set_options(display_style = \"html\")` will change the setting for the current Python session. \n", "\n", "The flag `display_html_expand` has three bits that correspond to (1) `MuData` attributes, (2) modalities, (3) `AnnData` attributes, and indicates if the fields should be expanded by default (`1`) or collapsed under the `` tag (`0`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### .h5mu files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`MuData` objects were designed to be serialized into `.h5mu` files. Modalities are stored under their respective names in the `/mod` HDF5 group of the `.h5mu` file. Each individual modality, e.g. `/mod/A`, is stored in the same way as it would be stored in the `.h5ad` file." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
MuData object with n_obs × n_vars = 999 × 104 backed at '/var/folders/xt/tvy3s7w17vn1b700k_351pz00000gp/T/muon_getting_started_4b8mn4v8.h5mu'\n",
       "  2 modalities\n",
       "    A:\t999 x 54\n",
       "    B:\t999 x 50
" ], "text/plain": [ "MuData object with n_obs × n_vars = 999 × 104 backed at '/var/folders/xt/tvy3s7w17vn1b700k_351pz00000gp/T/muon_getting_started_4b8mn4v8.h5mu'\n", " 2 modalities\n", " A:\t999 x 54\n", " B:\t999 x 50" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import tempfile\n", "\n", "# Create a temporary file\n", "temp_file = tempfile.NamedTemporaryFile(mode=\"w\", suffix=\".h5mu\", prefix=\"muon_getting_started_\")\n", "\n", "mdata.write(temp_file.name)\n", "mdata_r = mu.read(temp_file.name, backed=True)\n", "mdata_r" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Individual modalities are backed as well — inside the `.h5mu` file:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mdata_r[\"A\"].isbacked" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The rich representation would also reflect the _backed_ state of `MuData` objects when they are loaded from `.h5mu` files in the read-only mode and would point to the respective file:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "MuData object 999 obs × 104 var in 2 modalities
backed at /var/folders/xt/tvy3s7w17vn1b700k_351pz00000gp/T/muon_getting_started_4b8mn4v8.h5mu
Metadata
.obs0 elements
No metadata
Embeddings & mappings
.obsm2 elements
\n", " \n", " \n", "\n", " \n", "
A bool numpy.ndarray
B bool numpy.ndarray
Distances
.obsp0 elements
No distances
A
999 × 54
AnnData object 999 obs × 54 var
backed at /var/folders/xt/tvy3s7w17vn1b700k_351pz00000gp/T/muon_getting_started_4b8mn4v8.h5mu
Matrix
.X
\n", " float32    h5py._hl.dataset.Dataset\n", "
Layers
.layers0 elements
No layers
Metadata
.obs0 elements
No metadata
Embeddings
.obsm0 elements
No embeddings
Distances
.obsp0 elements
No distances
Miscellaneous
.uns0 elements
No miscellaneous
B
999 × 50
AnnData object 999 obs × 50 var
backed at /var/folders/xt/tvy3s7w17vn1b700k_351pz00000gp/T/muon_getting_started_4b8mn4v8.h5mu
Matrix
.X
\n", " float32    h5py._hl.dataset.Dataset\n", "
Layers
.layers0 elements
No layers
Metadata
.obs0 elements
No metadata
Embeddings
.obsm0 elements
No embeddings
Distances
.obsp0 elements
No distances
Miscellaneous
.uns0 elements
No miscellaneous

" ], "text/plain": [ "MuData object with n_obs × n_vars = 999 × 104 backed at '/var/folders/xt/tvy3s7w17vn1b700k_351pz00000gp/T/muon_getting_started_4b8mn4v8.h5mu'\n", " 2 modalities\n", " A:\t999 x 54\n", " B:\t999 x 50" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "with mu.set_options(display_style=\"html\", display_html_expand=0b000):\n", " display(mdata_r)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multimodal methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When the `MuData` object is prepared, it is up to multimodal methods to be used to make sense of the data. The most simple and naïve approach is to concatenate matrices from multiple modalities to perform e.g. dimensionality reduction." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(999, 104)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = np.hstack([mdata.mod[\"A\"].X, mdata.mod[\"B\"].X])\n", "x.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can write a simple function to run principal component analysis on such a concatenated matrix. `MuData` object provides a place to store multimodal embeddings — `MuData.obsm`. It is similar to how the embeddings generated on invidual modalities are stored, only this time it is saved inside the `MuData` object rather than in `AnnData.obsm`." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def simple_pca(mdata):\n", " from sklearn import decomposition\n", "\n", " x = np.hstack([m.X for m in mdata.mod.values()])\n", "\n", " pca = decomposition.PCA(n_components=2)\n", " components = pca.fit_transform(x)\n", "\n", " # By default, methods operate in-place\n", " # and embeddings are stored in the .obsm slot\n", " mdata.obsm[\"X_pca\"] = components\n", "\n", " return" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MuData object with n_obs × n_vars = 999 × 104\n", " obsm:\t'X_pca'\n", " 2 modalities\n", " A:\t999 x 54\n", " B:\t999 x 50\n" ] } ], "source": [ "simple_pca(mdata)\n", "print(mdata)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In reality, however, having different modalities often means that the features between them come from different generative processes and are not comparable.\n", "\n", "This is where special multimodal integration methods come into play. For omics technologies, these methods are frequently addressed as _multi-omics integration methods_. Such methods are included in `muon` out of the box, and `MuData` objects make it easy for the new methods to be easily applied to such data.\n", "\n", "More details on the multi-omics methods [are provided in the documentation here](https://muon.readthedocs.io/en/latest/omics/multi.html)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.6" } }, "nbformat": 4, "nbformat_minor": 4 }