MuData quickstart#
This notebooks provides an introduction to multimodal data objects.
muon
is a framework for multimodal data analysis with a strong focus on multi-omics.
With its multimodal objects built on top of AnnData and state-of-the-art integration methods built in, muon
fits naturally into the rich Python ecosystem for data analysis, and its modular design for the analysis of the individual omics assays provides necessary functionality for the common workflows out of the box.
[1]:
import muon as mu
from muon import MuData
Multimodal objects#
To see how multimodal objects behave, we will simulate some data first:
[2]:
import numpy as np
np.random.seed(1)
n, d, k = 1000, 100, 10
z = np.random.normal(loc=np.arange(k), scale=np.arange(k) * 2, size=(n, k))
w = np.random.normal(size=(d, k))
y = np.dot(z, w.T)
y.shape
[2]:
(1000, 100)
Creating an AnnData
object from the matrix will allow us to add annotations to its different dimensions (“observations”, e.g. samples, and measured “variables”):
[3]:
from anndata import AnnData
adata = AnnData(y)
adata.obs_names = [f"obs_{i+1}" for i in range(n)]
adata.var_names = [f"var_{j+1}" for j in range(d)]
adata
[3]:
AnnData object with n_obs × n_vars = 1000 × 100
We will go ahead and create a second object with data for the same observations but for different variables:
[4]:
d2 = 50
w2 = np.random.normal(size=(d2, k))
y2 = np.dot(z, w2.T)
adata2 = AnnData(y2)
adata2.obs_names = [f"obs_{i+1}" for i in range(n)]
adata2.var_names = [f"var2_{j+1}" for j in range(d2)]
adata2
[4]:
AnnData object with n_obs × n_vars = 1000 × 50
We can now wrap these two objects into a MuData
object:
[5]:
mdata = MuData({"A": adata, "B": adata2})
mdata
[5]:
MuData object with n_obs × n_vars = 1000 × 150 2 modalities A: 1000 x 100 B: 1000 x 50
Observations and variables of the MuData
object are global, which means that observations with the identical name (.obs_names
) in different modalities are considered to be the same observation. This also means variable names (.var_names
) should be unique.
This is reflected in the object description above: mdata
has 1000 observations and 150=100+50 variables.
Variable mappings#
Upon construction of a MuData
object, a global binary mapping between observations and individual modalities is created as well as between variables and modalities.
Since all the observations are the same across modalities in mdata
, all the values in the observations mappings are set to True
:
[6]:
np.sum(mdata.obsm["A"]) == np.sum(mdata.obsm["B"]) == n
[6]:
True
For variables, those are 150-long vectors, e.g. for the A
modality — with 100 True
values followed by 50 False
values:
[7]:
mdata.varm["A"]
[7]:
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False])
Object references#
Importantly, individual modalities are stored as references to the original objects.
[8]:
# Only keep variables with value > 1 in obs_1
# with in-place filtering for the variables
mu.pp.filter_var(adata, adata["obs_1", :].X.flatten() > 1)
adata
[8]:
AnnData object with n_obs × n_vars = 1000 × 54
[9]:
# Modalities can be accessed within the .mod attributes
mdata.mod["A"]
[9]:
AnnData object with n_obs × n_vars = 1000 × 54
This is also why the MuData
object has to be updated in order to reflect the latest changes to the modalities it includes:
[10]:
print(f"Outdated size:", mdata.varm["A"].sum())
mdata.update()
print(f"Updated size:", mdata.varm["A"].sum())
Outdated size: 100
Updated size: 54
Common observations#
While mdata
is comprised of the same observations for both modalities, it is not always the case in the real world where some data might be missing. By design, muon
accounts for these scenarios since there’s no guarantee observations are the same — or even intersecting — for a MuData
instance.
[11]:
# Throw away the last sample in the modality 'B'
# with in-place filtering for the observations
mu.pp.filter_obs(mdata.mod["B"], [True for _ in range(n - 1)] + [False])
[12]:
# adata2 object has also changed
assert mdata.mod["B"].shape == adata2.shape
mdata.update()
mdata
[12]:
MuData object with n_obs × n_vars = 1000 × 104 2 modalities A: 1000 x 54 B: 999 x 50
muon
provides, however, a simple function to drop the observations that are not present in all the modalities:
[13]:
mu.pp.intersect_obs(mdata)
mdata
[13]:
MuData object with n_obs × n_vars = 999 × 104 2 modalities A: 999 x 54 B: 999 x 50
Rich representation#
Some notebook environments such as Jupyter/IPython allow for the rich object representation. This is what muon
uses in order to provide an optional HTML representation that allows to interactively explore MuData
objects. While the dataset in our example is not the most comprehensive one, here is how it looks like:
[14]:
with mu.set_options(display_style="html", display_html_expand=0b000):
display(mdata)
Metadata.obs0 elements
No metadataEmbeddings & mappings.obsm2 elements
A | bool | numpy.ndarray | |
B | bool | numpy.ndarray |
Distances.obsp0 elements
No distancesA999 × 54
AnnData object 999 obs × 54 varLayers.layers0 elements
No layersMetadata.obs0 elements
No metadataEmbeddings.obsm0 elements
No embeddingsDistances.obsp0 elements
No distancesMiscellaneous.uns0 elements
No miscellaneousB999 × 50
AnnData object 999 obs × 50 varLayers.layers0 elements
No layersMetadata.obs0 elements
No metadataEmbeddings.obsm0 elements
No embeddingsDistances.obsp0 elements
No distancesMiscellaneous.uns0 elements
No miscellaneousRunning mu.set_options(display_style = "html")
will change the setting for the current Python session.
The flag display_html_expand
has three bits that correspond to (1) MuData
attributes, (2) modalities, (3) AnnData
attributes, and indicates if the fields should be expanded by default (1
) or collapsed under the <summary>
tag (0
).
.h5mu files#
MuData
objects were designed to be serialized into .h5mu
files. Modalities are stored under their respective names in the /mod
HDF5 group of the .h5mu
file. Each individual modality, e.g. /mod/A
, is stored in the same way as it would be stored in the .h5ad
file.
[15]:
import tempfile
# Create a temporary file
temp_file = tempfile.NamedTemporaryFile(mode="w", suffix=".h5mu", prefix="muon_getting_started_")
mdata.write(temp_file.name)
mdata_r = mu.read(temp_file.name, backed=True)
mdata_r
[15]:
MuData object with n_obs × n_vars = 999 × 104 backed at '/var/folders/xt/tvy3s7w17vn1b700k_351pz00000gp/T/muon_getting_started_4b8mn4v8.h5mu' 2 modalities A: 999 x 54 B: 999 x 50
Individual modalities are backed as well — inside the .h5mu
file:
[16]:
mdata_r["A"].isbacked
[16]:
True
The rich representation would also reflect the backed state of MuData
objects when they are loaded from .h5mu
files in the read-only mode and would point to the respective file:
[17]:
with mu.set_options(display_style="html", display_html_expand=0b000):
display(mdata_r)
↳ backed at /var/folders/xt/tvy3s7w17vn1b700k_351pz00000gp/T/muon_getting_started_4b8mn4v8.h5mu
Metadata.obs0 elements
No metadataEmbeddings & mappings.obsm2 elements
A | bool | numpy.ndarray | |
B | bool | numpy.ndarray |
Distances.obsp0 elements
No distancesA999 × 54
AnnData object 999 obs × 54 var↳ backed at /var/folders/xt/tvy3s7w17vn1b700k_351pz00000gp/T/muon_getting_started_4b8mn4v8.h5mu
Layers.layers0 elements
No layersMetadata.obs0 elements
No metadataEmbeddings.obsm0 elements
No embeddingsDistances.obsp0 elements
No distancesMiscellaneous.uns0 elements
No miscellaneousB999 × 50
AnnData object 999 obs × 50 var↳ backed at /var/folders/xt/tvy3s7w17vn1b700k_351pz00000gp/T/muon_getting_started_4b8mn4v8.h5mu
Layers.layers0 elements
No layersMetadata.obs0 elements
No metadataEmbeddings.obsm0 elements
No embeddingsDistances.obsp0 elements
No distancesMiscellaneous.uns0 elements
No miscellaneousMultimodal methods#
When the MuData
object is prepared, it is up to multimodal methods to be used to make sense of the data. The most simple and naïve approach is to concatenate matrices from multiple modalities to perform e.g. dimensionality reduction.
[18]:
x = np.hstack([mdata.mod["A"].X, mdata.mod["B"].X])
x.shape
[18]:
(999, 104)
We can write a simple function to run principal component analysis on such a concatenated matrix. MuData
object provides a place to store multimodal embeddings — MuData.obsm
. It is similar to how the embeddings generated on invidual modalities are stored, only this time it is saved inside the MuData
object rather than in AnnData.obsm
.
[19]:
def simple_pca(mdata):
from sklearn import decomposition
x = np.hstack([m.X for m in mdata.mod.values()])
pca = decomposition.PCA(n_components=2)
components = pca.fit_transform(x)
# By default, methods operate in-place
# and embeddings are stored in the .obsm slot
mdata.obsm["X_pca"] = components
return
[20]:
simple_pca(mdata)
print(mdata)
MuData object with n_obs × n_vars = 999 × 104
obsm: 'X_pca'
2 modalities
A: 999 x 54
B: 999 x 50
In reality, however, having different modalities often means that the features between them come from different generative processes and are not comparable.
This is where special multimodal integration methods come into play. For omics technologies, these methods are frequently addressed as multi-omics integration methods. Such methods are included in muon
out of the box, and MuData
objects make it easy for the new methods to be easily applied to such data.
More details on the multi-omics methods are provided in the documentation here.