Using CellPipeline#

A cell type specific analysis and visualization tool for the gene of interest#

This notebook is built to be run automatically, you can just “Run All” cells. Beware: this requires some patience and high computational resources at the moment.

First, the data and package are loaded. This may take a minute. Set your gene of interest (GOI) here!

[10]:

import sys
sys.path.append('/lustre/groups/ml01/workspace/samantha.bening/Bachelor/')
from importlib import reload
import genereporter.cell_pipeline as cp
reload(cp)

cp = cp.CellPipeline("/lustre/groups/ml01/workspace/samantha.bening/Bachelor/", "data2/veo_ibd_balanced.h5ad")

# set your gene of interest
GOI = "CASP8"
# set your cell type of interest
cell_type = 'CD4 T'

Below is a list of possible coarse (level 1) cell types. Choose one of these as your cell type of interest above (cell_type = ‘[your cell type]’) to run the notebook automatically. Of course, you can rerun certain outputs on different cell types as well.

[2]:

# print cell type names here; easier to select
print(f"Coarse cell types: ")
for cell_type in cp.adata.obs['celltype_l2'].unique():
    print(f"\t{str(cell_type)}")

Coarse cell types:
        Pericyte
        B
        Endothelial
        CD4 T
        CD8 T
        NK_ILC
        Fibroblast
        Cycling B
        Plasma
        Cycling Myeloid
        Cycling Stroma
        Cycling T
        Epithelial
        Glial
        Myeloid
        Tuft
        Smooth Muscle Cell
        pDC
        Mast

[3]:

# UMAP of coarse cell types
cp.plot_umap(color="celltype_l2")

Next, we provide a quick summary of the GOI’s expression class and mean expression level across all cell types.

[4]:

expr_sum = cp.explain_expr_celltypes(GOI='CASP8')
expr_sum

[4]:

	Cell type	Expression class	Avg. expression over cell type
CASP8	pDC	low	0.349
CASP8	CD4 T	very low	0.266
CASP8	Cycling T	very low	0.262
CASP8	CD8 T	very low	0.243
CASP8	NK_ILC	very low	0.240
CASP8	Mast	very low	0.202
CASP8	B	very low	0.140
CASP8	Cycling B	very low	0.122
CASP8	Cycling Myeloid	very low	0.109
CASP8	Plasma	very low	0.109
CASP8	Tuft	very low	0.108
CASP8	Myeloid	very low	0.104
CASP8	Epithelial	very low	0.091
CASP8	Endothelial	very low	0.064
CASP8	Cycling Stroma	very low	0.056
CASP8	Pericyte	very low	0.037
CASP8	Fibroblast	very low	0.037
CASP8	Glial	very low	0.023
CASP8	Smooth Muscle Cell	very low	0.013

[11]:

cp.plot_expressions(GOI, cell_type=cell_type, show_summary=True)
# Can change show_summary=False to hide the textual summary of the expression classes (quantile thresholds and cell counts per category)

Summary for all cells:
Quantile thresholds:
very low: 96.2325, low: 98.8921, middle: 99.4425, high: 99.7479, very high: 99.7500

Number of genes per category:
very_low: 27101
low: 749
middle: 155
high: 86
very_high: 71


Summary for CD4 T cells:
Quantile thresholds:
very low: 96.5912, low: 98.988, middle: 99.4709, high: 99.7479, very high: 99.7500

Number of genes per category:
very_low: 27202
low: 675
middle: 136
high: 78
very_high: 71

Expression vs. Detection visualization#

This can contextualize the expression levels we observe in the standard scanpy plots. In single-cell RNA-seq, only a random sampling of the RNA present in a cell is selected to be sequenced. By pure chance, lowly expressed genes may not be present in all the sampled RNA due to their low prevalance. Here, we can inspect the maximum percentage of expression expected in all genes, specifically our gene of interest.

[12]:

cp.expression_vs_detection(GOI, cell_type=cell_type)
# Can add (or remove) "cell_type=cell_type" to plot only the cell type of interest (or across all cell types)
# todo this section before dotplots etc.

Automatically identify lower outliers (clue to look at celltype subset)#

[13]:

cp.plot_outliers(GOI, outlier_threshold=0.1, cell_type=cell_type)
# Can add "cell_type=cell_type" to plot only the cell type of interest

This is how the maximum threshold curve approximation is calculated. This is primarily interesting for our fundamental understanding of the curve’s approximation through the spline’s 3rd derivative’s change points and the linear approximation of this curve.

[14]:

cp.fit_spline(plot=True, cell_type=cell_type)

These are the top 5 number of outliers, sorted by their distance away from the maximum curve. You can show more or less by changing the head=n parameter.

[15]:

cp.list_outliers(cell_type=cell_type)

# can show top n number of genes by adding "head=n"

[15]:

	log1p(means)	percent_detected	distance	is_outlier
HSPA1A	1.041783	0.459530	0.394890	True
HSPA1B	0.911569	0.453423	0.339097	True
IGKC	0.581478	0.103288	0.320663	True
KLF2	0.865623	0.474351	0.299710	True
CCL4	0.509195	0.062818	0.299202	True

GOI expression across cell types#

Now we show the standard scanpy plots of our GOI’s expression across both coarse cell types and fine cell types. The fine cell type automatically shown in the one you set at the beginning of this notebook. You can rerun the cell with other cell types of interest by setting the cell_type=[‘your cell type’] parameter.

[23]:

# GOI expression across coarse cell types
cp.dotplot(GOI)

[17]:

# GOI expression in fine cell type
cp.dotplot(GOI, cell_type=cell_type)

[21]:

# GOI expression across coarse cell types
# This is similar to the coarse cell type dotplot previously, just a different visualization
cp.matrixplot(GOI)

[19]:

# GOI expression across coarse cell types
# Individual vertical "lines" correspond to individual cells
# A more fine grained visual than the mean expression plots shown before
cp.heatmap(GOI)