Cell Type Specific Analysis#

CellPipeline is a pipeline focused on cell type specific analysis over all cells and specific cell types, with respect to the gene of interest. See the Using CellPipeline for a detailed example of how to use it.

class cell_pipeline.CellPipeline(wdir: str, data_file: str)#

__init__(wdir: str, data_file: str)#

Initialize the CellPipeline class. Load the AnnData file and set the working directory. The AnnData given here will be used for all subsequent analyses.

Parameters:

wdir (str) – The working directory.
data_file (str) – The directory to the AnnData file to load.

calc_distance(x: Series, y: Series, cell_type: str = None) → List[float]#

This function calculates the orthogonal distance of each point in the given x and y series from the spline fitted to the data.

Parameters:

x (pandas.Series) – The x-coordinates of the points.
y (pandas.Series) – The y-coordinates of the points.

Returns:

A list of the orthogonal distances of each point from the spline.

Return type:

List[float]

calc_distance_point(point: Tuple[float, float], p1: Tuple[float, float], p2: Tuple[float, float]) → float#

This function calculates the orthogonal distance of a point from the line defined by two other points.

Parameters:

point (Tuple[float, float]) – The point for which the distance is to be calculated.
p1 (Tuple[float, float]) – The first point defining the line.
p2 (Tuple[float, float]) – The second point defining the line.

Returns:

The orthogonal distance of the point from the line.

Return type:

float

classify_exp_level(df, filtered, col) → Tuple[DataFrame, str]#

This function classifies the expression level of genes into five categories: very low, low, middle, high, very high. It also generates a summary of the quantile thresholds and the number of genes in each category.

Parameters:

df (pandas.DataFrame) – The input DataFrame containing the gene expression data.
filtered (list or numpy.ndarray) – The filtered gene expression data for which to calculate the thresholds.
col (str) – The column in df containing the gene expression data.

Returns:

The DataFrame with an additional column for the expression level category, and the summary string.

Return type:

tuple(pandas.DataFrame, str)

clean_data(df: DataFrame, col: str, threshold=99.75) → ndarray#

This function cleans the data by removing outliers.

Parameters:

df (pandas.DataFrame) – The DataFrame containing the data to be cleaned.
col (str) – The column in the DataFrame to be cleaned.
threshold (float, optional) – The percentile above which data points are considered outliers. Default is 99.75.

Returns:

The cleaned data as a 1D numpy array.

Return type:

numpy.ndarray

detect_outliers(cell_type: str = None, outlier_threshold: float = 0.15) → DataFrame#

This function detects outliers in the expression versus detection data of a given cell type. It calculates the orthogonal distance of each point from the spline fitted to the data, and marks points with a distance greater than the specified outlier threshold as outliers.

Parameters:

cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.
outlier_threshold (float, optional) – The threshold for marking a point as an outlier. If the orthogonal distance of a point from the spline is greater than this threshold, it is marked as an outlier. Default is 0.15.
adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.

Returns:

A DataFrame with the mean expression, percentage detection, orthogonal distance from the spline, and outlier status of each gene for the given cell type.

Return type:

pandas.DataFrame

dotplot(GOI: str, cell_type: str = None, **kwargs) → None#

This function creates a dot plot of the expression of a given gene of interest (GOI) across all cell types or a specific cell type.

Parameters:

GOI (str) – The gene of interest.
cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.
adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.
kwargs – Additional keyword arguments to be passed to the sc.pl.dotplot function.

Returns:

None

explain_expr_celltypes(GOI: str, col: str = 'log1p(means)', layer='log_norm') → DataFrame#

This function explains the expression of cell types for a given gene of interest (GOI). It creates a DataFrame for each cell type, calculates the expression level of the GOI, and concatenates the results. The resulting DataFrame is then formatted for output.

Parameters:

GOI (str) – The gene of interest.
adata (anndata.AnnData) – The input AnnData object containing the gene expression data.
col (str, optional) – The column in the DataFrame to be used for the expression level calculation. Default is ‘log1p(means)’.
layer (str, optional) – The layer of the AnnData object to be used for calculating highly variable genes. Default is ‘log_norm’.

Returns:

The DataFrame with the expression level of the GOI for each cell type.

Return type:

pandas.DataFrame

expression_vs_detection(GOI: str, adata=None, cell_type: str = None, col: str = 'log1p(means)', return_df: bool = False) → None | DataFrame#

This function creates a DataFrame, calculates the percentage of cells where each gene is detected, and optionally returns the DataFrame. If return_df=False, the function plots the mean expression of a given gene of interest (GOI) versus the percentage of cells where this gene is detected, either across all cell types or a specific cell type.

Parameters:

GOI (str) – The gene of interest.
adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.
cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.
layer (str, optional) – The layer of the AnnData object to be used for calculating highly variable genes. Default is ‘log_norm’.
col (str, optional) – The column in the DataFrame to be used for the x-axis. Default is ‘log1p(means)’.
return_df (bool, optional) – Whether to return the DataFrame. If False, the function will plot the data and return None. Default is False.

Returns:

If return_df is True, the DataFrame with the mean expression and percentage detection of the GOI for each cell type. Otherwise, None.

Return type:

Union[None, pandas.DataFrame]

find_thresholds(filtered)#

This function calculates and returns four thresholds (very low, low, middle, high) based on the range of the input data. The very_high threshold is set to the 99.75th percentile of the input data.

Parameters:: filtered (list or numpy.ndarray) – The input data for which to calculate the thresholds.
Returns:: The calculated thresholds (very low, low, middle, high).
Return type:: tuple

fit_spline(cell_type: str = None, plot: bool = False) → None | Tuple[ndarray, ndarray]#

This function fits a spline to the expression versus detection data of a given cell type, and optionally plots the data and the fitted spline.

Parameters:

cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.
adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.
plot (bool, optional) – Whether to plot the data and the fitted spline. If False, the function will return the spline parameters and inflection points. Default is False.

Returns:

If plot is False, a tuple containing the spline parameters and the inflection points. Otherwise, None.

Return type:

Union[None, Tuple[numpy.ndarray, numpy.ndarray]]

get_adata() → AnnData#

This function returns the AnnData object loaded in this module.

Returns:: The AnnData object loaded in this module.
Return type:: anndata.AnnData

heatmap(GOI: str, layer: str = None) → None#

This function creates a heatmap of the expression of a given gene of interest (GOI) across all cell types.

Parameters:

GOI (str) – The gene of interest.
layer (str, optional) – The layer of the AnnData object to be used for calculating highly variable genes. Default is ‘log_norm’.
adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.

Returns:

None

list_outliers(cell_type: str = None, head: int = 5) → DataFrame#

This function lists the top outliers in the expression versus detection data for a given cell type. It sorts the data by the orthogonal distance from the spline, in descending order, and returns the top ‘head’ number of outliers.

Parameters:

cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.
head (int, optional) – The number of top outliers to return. Default is 5.

Returns:

A DataFrame with the top ‘head’ number of outliers, sorted by the orthogonal distance from the spline in descending order.

Return type:

pandas.DataFrame

make_df(adata, threshold: float = 99.75, col: str = 'log1p(means)') → Tuple[DataFrame, str]#

This function creates a DataFrame from the given AnnData object, where the genes’ expression level is calculated and classified into five categories: very low, low, middle, high, very high. It also generates a summary of the quantile thresholds and the number of genes in each category.

Parameters:

adata (anndata.AnnData) – The input AnnData object containing the gene expression data.
threshold (float, optional) – The percentile above which data points are considered outliers (i.e. very_high class). Default is 99.75.
col (str, optional) – The column in the DataFrame to be used for the expression level classification. Default is ‘log1p(means)’.
layer (str, optional) – The layer of the AnnData object to be used for calculating highly variable genes. Default is ‘log_norm’.

Returns:

The DataFrame with additional columns for the gene number and expression level category, and the summary string.

Return type:

tuple(pandas.DataFrame, str)

matrixplot(GOI: str, cell_type: str = None) → None#

This function creates a matrix plot of the expression of a given gene of interest (GOI) across all cell types or a specific cell type.

Parameters:

GOI (str) – The gene of interest.
cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.
adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.

Returns:

None

plot_expr_class(GOI: str, ax, adata, cell_type: str = None, col: str = 'log1p(means)') → Tuple[DataFrame, str]#

This function plots the expression class of a given gene of interest (GOI) across all cell types or a specific cell type. It creates a DataFrame, highlights the GOI on the plot, and annotates it with its expression class.

Parameters:

GOI (str) – The gene of interest.
ax (matplotlib.axes.Axes) – The axes object to draw the plot onto.
adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.
cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.
col (str, optional) – The column in the DataFrame to be used for the y-axis. Default is ‘log1p(means)’.

Returns:

The DataFrame with the expression level of the GOI for each cell type, and the summary string.

Return type:

tuple(pandas.DataFrame, str)

plot_expressions(GOI: str, cell_type: str = 'CD4 T', show_summary: bool = False) → None#

This function plots the expression class of a given gene of interest (GOI) across all cell types and a specific cell type. It creates two subplots, one for all cell types and one for the specific cell type, and optionally prints a summary of the expression class for each plot.

Parameters:

GOI (str) – The gene of interest.
cell_type (str, optional) – The specific cell type to plot. Default is ‘T cell’.
show_summary (bool, optional) – Whether to print a summary of the expression class for each plot. Default is False.
adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.

Returns:

None

plot_outliers(GOI: str, cell_type: str = None, outlier_threshold: float = 0.15) → None#

This function plots the expression versus detection data for a given cell type, with outliers highlighted. It also highlights a gene of interest (GOI) in the plot.

Parameters:

GOI (str) – The gene of interest to be highlighted in the plot.
cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.
outlier_threshold (float, optional) – The threshold for marking a point as an outlier. If the orthogonal distance of a point from the spline is greater than this threshold, it is marked as an outlier. Default is 0.15.

Returns:

None

plot_umap(color: str = 'celltype_l2') → None#

This function plots a UMAP of the given AnnData object, grouping by coarse cell type in the adata object.

Parameters:: color (str) – The color to use for the plot. Default is “celltypist_cell_label_coarse”.