Cell Type Specific Analysis#

CellPipeline is a pipeline focused on cell type specific analysis over all cells and specific cell types, with respect to the gene of interest. See the Using CellPipeline for a detailed example of how to use it.

class cell_pipeline.CellPipeline(wdir: str, data_file: str)#
__init__(wdir: str, data_file: str)#

Initialize the CellPipeline class. Load the AnnData file and set the working directory. The AnnData given here will be used for all subsequent analyses.

Parameters:
  • wdir (str) – The working directory.

  • data_file (str) – The directory to the AnnData file to load.

calc_distance(x: Series, y: Series, cell_type: str = None) List[float]#

This function calculates the orthogonal distance of each point in the given x and y series from the spline fitted to the data.

Parameters:
  • x (pandas.Series) – The x-coordinates of the points.

  • y (pandas.Series) – The y-coordinates of the points.

Returns:

A list of the orthogonal distances of each point from the spline.

Return type:

List[float]

calc_distance_point(point: Tuple[float, float], p1: Tuple[float, float], p2: Tuple[float, float]) float#

This function calculates the orthogonal distance of a point from the line defined by two other points.

Parameters:
  • point (Tuple[float, float]) – The point for which the distance is to be calculated.

  • p1 (Tuple[float, float]) – The first point defining the line.

  • p2 (Tuple[float, float]) – The second point defining the line.

Returns:

The orthogonal distance of the point from the line.

Return type:

float

classify_exp_level(df, filtered, col) Tuple[DataFrame, str]#

This function classifies the expression level of genes into five categories: very low, low, middle, high, very high. It also generates a summary of the quantile thresholds and the number of genes in each category.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame containing the gene expression data.

  • filtered (list or numpy.ndarray) – The filtered gene expression data for which to calculate the thresholds.

  • col (str) – The column in df containing the gene expression data.

Returns:

The DataFrame with an additional column for the expression level category, and the summary string.

Return type:

tuple(pandas.DataFrame, str)

clean_data(df: DataFrame, col: str, threshold=99.75) ndarray#

This function cleans the data by removing outliers.

Parameters:
  • df (pandas.DataFrame) – The DataFrame containing the data to be cleaned.

  • col (str) – The column in the DataFrame to be cleaned.

  • threshold (float, optional) – The percentile above which data points are considered outliers. Default is 99.75.

Returns:

The cleaned data as a 1D numpy array.

Return type:

numpy.ndarray

detect_outliers(cell_type: str = None, outlier_threshold: float = 0.15) DataFrame#

This function detects outliers in the expression versus detection data of a given cell type. It calculates the orthogonal distance of each point from the spline fitted to the data, and marks points with a distance greater than the specified outlier threshold as outliers.

Parameters:
  • cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.

  • outlier_threshold (float, optional) – The threshold for marking a point as an outlier. If the orthogonal distance of a point from the spline is greater than this threshold, it is marked as an outlier. Default is 0.15.

  • adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.

Returns:

A DataFrame with the mean expression, percentage detection, orthogonal distance from the spline, and outlier status of each gene for the given cell type.

Return type:

pandas.DataFrame

dotplot(GOI: str, cell_type: str = None, **kwargs) None#

This function creates a dot plot of the expression of a given gene of interest (GOI) across all cell types or a specific cell type.

Parameters:
  • GOI (str) – The gene of interest.

  • cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.

  • adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.

  • kwargs – Additional keyword arguments to be passed to the sc.pl.dotplot function.

Returns:

None

explain_expr_celltypes(GOI: str, col: str = 'log1p(means)', layer='log_norm') DataFrame#

This function explains the expression of cell types for a given gene of interest (GOI). It creates a DataFrame for each cell type, calculates the expression level of the GOI, and concatenates the results. The resulting DataFrame is then formatted for output.

Parameters:
  • GOI (str) – The gene of interest.

  • adata (anndata.AnnData) – The input AnnData object containing the gene expression data.

  • col (str, optional) – The column in the DataFrame to be used for the expression level calculation. Default is ‘log1p(means)’.

  • layer (str, optional) – The layer of the AnnData object to be used for calculating highly variable genes. Default is ‘log_norm’.

Returns:

The DataFrame with the expression level of the GOI for each cell type.

Return type:

pandas.DataFrame

expression_vs_detection(GOI: str, adata=None, cell_type: str = None, col: str = 'log1p(means)', return_df: bool = False) None | DataFrame#

This function creates a DataFrame, calculates the percentage of cells where each gene is detected, and optionally returns the DataFrame. If return_df=False, the function plots the mean expression of a given gene of interest (GOI) versus the percentage of cells where this gene is detected, either across all cell types or a specific cell type.

Parameters:
  • GOI (str) – The gene of interest.

  • adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.

  • cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.

  • layer (str, optional) – The layer of the AnnData object to be used for calculating highly variable genes. Default is ‘log_norm’.

  • col (str, optional) – The column in the DataFrame to be used for the x-axis. Default is ‘log1p(means)’.

  • return_df (bool, optional) – Whether to return the DataFrame. If False, the function will plot the data and return None. Default is False.

Returns:

If return_df is True, the DataFrame with the mean expression and percentage detection of the GOI for each cell type. Otherwise, None.

Return type:

Union[None, pandas.DataFrame]

find_thresholds(filtered)#

This function calculates and returns four thresholds (very low, low, middle, high) based on the range of the input data. The very_high threshold is set to the 99.75th percentile of the input data.

Parameters:

filtered (list or numpy.ndarray) – The input data for which to calculate the thresholds.

Returns:

The calculated thresholds (very low, low, middle, high).

Return type:

tuple

fit_spline(cell_type: str = None, plot: bool = False) None | Tuple[ndarray, ndarray]#

This function fits a spline to the expression versus detection data of a given cell type, and optionally plots the data and the fitted spline.

Parameters:
  • cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.

  • adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.

  • plot (bool, optional) – Whether to plot the data and the fitted spline. If False, the function will return the spline parameters and inflection points. Default is False.

Returns:

If plot is False, a tuple containing the spline parameters and the inflection points. Otherwise, None.

Return type:

Union[None, Tuple[numpy.ndarray, numpy.ndarray]]

get_adata() AnnData#

This function returns the AnnData object loaded in this module.

Returns:

The AnnData object loaded in this module.

Return type:

anndata.AnnData

heatmap(GOI: str, layer: str = None) None#

This function creates a heatmap of the expression of a given gene of interest (GOI) across all cell types.

Parameters:
  • GOI (str) – The gene of interest.

  • layer (str, optional) – The layer of the AnnData object to be used for calculating highly variable genes. Default is ‘log_norm’.

  • adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.

Returns:

None

list_outliers(cell_type: str = None, head: int = 5) DataFrame#

This function lists the top outliers in the expression versus detection data for a given cell type. It sorts the data by the orthogonal distance from the spline, in descending order, and returns the top ‘head’ number of outliers.

Parameters:
  • cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.

  • head (int, optional) – The number of top outliers to return. Default is 5.

Returns:

A DataFrame with the top ‘head’ number of outliers, sorted by the orthogonal distance from the spline in descending order.

Return type:

pandas.DataFrame

make_df(adata, threshold: float = 99.75, col: str = 'log1p(means)') Tuple[DataFrame, str]#

This function creates a DataFrame from the given AnnData object, where the genes’ expression level is calculated and classified into five categories: very low, low, middle, high, very high. It also generates a summary of the quantile thresholds and the number of genes in each category.

Parameters:
  • adata (anndata.AnnData) – The input AnnData object containing the gene expression data.

  • threshold (float, optional) – The percentile above which data points are considered outliers (i.e. very_high class). Default is 99.75.

  • col (str, optional) – The column in the DataFrame to be used for the expression level classification. Default is ‘log1p(means)’.

  • layer (str, optional) – The layer of the AnnData object to be used for calculating highly variable genes. Default is ‘log_norm’.

Returns:

The DataFrame with additional columns for the gene number and expression level category, and the summary string.

Return type:

tuple(pandas.DataFrame, str)

matrixplot(GOI: str, cell_type: str = None) None#

This function creates a matrix plot of the expression of a given gene of interest (GOI) across all cell types or a specific cell type.

Parameters:
  • GOI (str) – The gene of interest.

  • cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.

  • adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.

Returns:

None

plot_expr_class(GOI: str, ax, adata, cell_type: str = None, col: str = 'log1p(means)') Tuple[DataFrame, str]#

This function plots the expression class of a given gene of interest (GOI) across all cell types or a specific cell type. It creates a DataFrame, highlights the GOI on the plot, and annotates it with its expression class.

Parameters:
  • GOI (str) – The gene of interest.

  • ax (matplotlib.axes.Axes) – The axes object to draw the plot onto.

  • adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.

  • cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.

  • col (str, optional) – The column in the DataFrame to be used for the y-axis. Default is ‘log1p(means)’.

Returns:

The DataFrame with the expression level of the GOI for each cell type, and the summary string.

Return type:

tuple(pandas.DataFrame, str)

plot_expressions(GOI: str, cell_type: str = 'CD4 T', show_summary: bool = False) None#

This function plots the expression class of a given gene of interest (GOI) across all cell types and a specific cell type. It creates two subplots, one for all cell types and one for the specific cell type, and optionally prints a summary of the expression class for each plot.

Parameters:
  • GOI (str) – The gene of interest.

  • cell_type (str, optional) – The specific cell type to plot. Default is ‘T cell’.

  • show_summary (bool, optional) – Whether to print a summary of the expression class for each plot. Default is False.

  • adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.

Returns:

None

plot_outliers(GOI: str, cell_type: str = None, outlier_threshold: float = 0.15) None#

This function plots the expression versus detection data for a given cell type, with outliers highlighted. It also highlights a gene of interest (GOI) in the plot.

Parameters:
  • GOI (str) – The gene of interest to be highlighted in the plot.

  • cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.

  • outlier_threshold (float, optional) – The threshold for marking a point as an outlier. If the orthogonal distance of a point from the spline is greater than this threshold, it is marked as an outlier. Default is 0.15.

Returns:

None

plot_umap(color: str = 'celltype_l2') None#

This function plots a UMAP of the given AnnData object, grouping by coarse cell type in the adata object.

Parameters:

color (str) – The color to use for the plot. Default is “celltypist_cell_label_coarse”.