Cell Type Specific Analysis#
CellPipeline is a pipeline focused on cell type specific analysis over all cells and specific cell types, with respect to the gene of interest. See the Using CellPipeline for a detailed example of how to use it.
- class cell_pipeline.CellPipeline(wdir: str, data_file: str)#
- __init__(wdir: str, data_file: str)#
Initialize the CellPipeline class. Load the AnnData file and set the working directory. The AnnData given here will be used for all subsequent analyses.
- Parameters:
wdir (str) – The working directory.
data_file (str) – The directory to the AnnData file to load.
- calc_distance(x: Series, y: Series, cell_type: str = None) List[float]#
This function calculates the orthogonal distance of each point in the given x and y series from the spline fitted to the data.
- Parameters:
x (pandas.Series) – The x-coordinates of the points.
y (pandas.Series) – The y-coordinates of the points.
- Returns:
A list of the orthogonal distances of each point from the spline.
- Return type:
List[float]
- calc_distance_point(point: Tuple[float, float], p1: Tuple[float, float], p2: Tuple[float, float]) float#
This function calculates the orthogonal distance of a point from the line defined by two other points.
- Parameters:
point (Tuple[float, float]) – The point for which the distance is to be calculated.
p1 (Tuple[float, float]) – The first point defining the line.
p2 (Tuple[float, float]) – The second point defining the line.
- Returns:
The orthogonal distance of the point from the line.
- Return type:
float
- classify_exp_level(df, filtered, col) Tuple[DataFrame, str]#
This function classifies the expression level of genes into five categories: very low, low, middle, high, very high. It also generates a summary of the quantile thresholds and the number of genes in each category.
- Parameters:
df (pandas.DataFrame) – The input DataFrame containing the gene expression data.
filtered (list or numpy.ndarray) – The filtered gene expression data for which to calculate the thresholds.
col (str) – The column in df containing the gene expression data.
- Returns:
The DataFrame with an additional column for the expression level category, and the summary string.
- Return type:
tuple(pandas.DataFrame, str)
- clean_data(df: DataFrame, col: str, threshold=99.75) ndarray#
This function cleans the data by removing outliers.
- Parameters:
df (pandas.DataFrame) – The DataFrame containing the data to be cleaned.
col (str) – The column in the DataFrame to be cleaned.
threshold (float, optional) – The percentile above which data points are considered outliers. Default is 99.75.
- Returns:
The cleaned data as a 1D numpy array.
- Return type:
numpy.ndarray
- detect_outliers(cell_type: str = None, outlier_threshold: float = 0.15) DataFrame#
This function detects outliers in the expression versus detection data of a given cell type. It calculates the orthogonal distance of each point from the spline fitted to the data, and marks points with a distance greater than the specified outlier threshold as outliers.
- Parameters:
cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.
outlier_threshold (float, optional) – The threshold for marking a point as an outlier. If the orthogonal distance of a point from the spline is greater than this threshold, it is marked as an outlier. Default is 0.15.
adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.
- Returns:
A DataFrame with the mean expression, percentage detection, orthogonal distance from the spline, and outlier status of each gene for the given cell type.
- Return type:
pandas.DataFrame
- dotplot(GOI: str, cell_type: str = None, **kwargs) None#
This function creates a dot plot of the expression of a given gene of interest (GOI) across all cell types or a specific cell type.
- Parameters:
GOI (str) – The gene of interest.
cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.
adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.
kwargs – Additional keyword arguments to be passed to the sc.pl.dotplot function.
- Returns:
None
- explain_expr_celltypes(GOI: str, col: str = 'log1p(means)', layer='log_norm') DataFrame#
This function explains the expression of cell types for a given gene of interest (GOI). It creates a DataFrame for each cell type, calculates the expression level of the GOI, and concatenates the results. The resulting DataFrame is then formatted for output.
- Parameters:
GOI (str) – The gene of interest.
adata (anndata.AnnData) – The input AnnData object containing the gene expression data.
col (str, optional) – The column in the DataFrame to be used for the expression level calculation. Default is ‘log1p(means)’.
layer (str, optional) – The layer of the AnnData object to be used for calculating highly variable genes. Default is ‘log_norm’.
- Returns:
The DataFrame with the expression level of the GOI for each cell type.
- Return type:
pandas.DataFrame
- expression_vs_detection(GOI: str, adata=None, cell_type: str = None, col: str = 'log1p(means)', return_df: bool = False) None | DataFrame#
This function creates a DataFrame, calculates the percentage of cells where each gene is detected, and optionally returns the DataFrame. If return_df=False, the function plots the mean expression of a given gene of interest (GOI) versus the percentage of cells where this gene is detected, either across all cell types or a specific cell type.
- Parameters:
GOI (str) – The gene of interest.
adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.
cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.
layer (str, optional) – The layer of the AnnData object to be used for calculating highly variable genes. Default is ‘log_norm’.
col (str, optional) – The column in the DataFrame to be used for the x-axis. Default is ‘log1p(means)’.
return_df (bool, optional) – Whether to return the DataFrame. If False, the function will plot the data and return None. Default is False.
- Returns:
If return_df is True, the DataFrame with the mean expression and percentage detection of the GOI for each cell type. Otherwise, None.
- Return type:
Union[None, pandas.DataFrame]
- find_thresholds(filtered)#
This function calculates and returns four thresholds (very low, low, middle, high) based on the range of the input data. The very_high threshold is set to the 99.75th percentile of the input data.
- Parameters:
filtered (list or numpy.ndarray) – The input data for which to calculate the thresholds.
- Returns:
The calculated thresholds (very low, low, middle, high).
- Return type:
tuple
- fit_spline(cell_type: str = None, plot: bool = False) None | Tuple[ndarray, ndarray]#
This function fits a spline to the expression versus detection data of a given cell type, and optionally plots the data and the fitted spline.
- Parameters:
cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.
adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.
plot (bool, optional) – Whether to plot the data and the fitted spline. If False, the function will return the spline parameters and inflection points. Default is False.
- Returns:
If plot is False, a tuple containing the spline parameters and the inflection points. Otherwise, None.
- Return type:
Union[None, Tuple[numpy.ndarray, numpy.ndarray]]
- get_adata() AnnData#
This function returns the AnnData object loaded in this module.
- Returns:
The AnnData object loaded in this module.
- Return type:
anndata.AnnData
- heatmap(GOI: str, layer: str = None) None#
This function creates a heatmap of the expression of a given gene of interest (GOI) across all cell types.
- Parameters:
GOI (str) – The gene of interest.
layer (str, optional) – The layer of the AnnData object to be used for calculating highly variable genes. Default is ‘log_norm’.
adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.
- Returns:
None
- list_outliers(cell_type: str = None, head: int = 5) DataFrame#
This function lists the top outliers in the expression versus detection data for a given cell type. It sorts the data by the orthogonal distance from the spline, in descending order, and returns the top ‘head’ number of outliers.
- Parameters:
cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.
head (int, optional) – The number of top outliers to return. Default is 5.
- Returns:
A DataFrame with the top ‘head’ number of outliers, sorted by the orthogonal distance from the spline in descending order.
- Return type:
pandas.DataFrame
- make_df(adata, threshold: float = 99.75, col: str = 'log1p(means)') Tuple[DataFrame, str]#
This function creates a DataFrame from the given AnnData object, where the genes’ expression level is calculated and classified into five categories: very low, low, middle, high, very high. It also generates a summary of the quantile thresholds and the number of genes in each category.
- Parameters:
adata (anndata.AnnData) – The input AnnData object containing the gene expression data.
threshold (float, optional) – The percentile above which data points are considered outliers (i.e. very_high class). Default is 99.75.
col (str, optional) – The column in the DataFrame to be used for the expression level classification. Default is ‘log1p(means)’.
layer (str, optional) – The layer of the AnnData object to be used for calculating highly variable genes. Default is ‘log_norm’.
- Returns:
The DataFrame with additional columns for the gene number and expression level category, and the summary string.
- Return type:
tuple(pandas.DataFrame, str)
- matrixplot(GOI: str, cell_type: str = None) None#
This function creates a matrix plot of the expression of a given gene of interest (GOI) across all cell types or a specific cell type.
- Parameters:
GOI (str) – The gene of interest.
cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.
adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.
- Returns:
None
- plot_expr_class(GOI: str, ax, adata, cell_type: str = None, col: str = 'log1p(means)') Tuple[DataFrame, str]#
This function plots the expression class of a given gene of interest (GOI) across all cell types or a specific cell type. It creates a DataFrame, highlights the GOI on the plot, and annotates it with its expression class.
- Parameters:
GOI (str) – The gene of interest.
ax (matplotlib.axes.Axes) – The axes object to draw the plot onto.
adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.
cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.
col (str, optional) – The column in the DataFrame to be used for the y-axis. Default is ‘log1p(means)’.
- Returns:
The DataFrame with the expression level of the GOI for each cell type, and the summary string.
- Return type:
tuple(pandas.DataFrame, str)
- plot_expressions(GOI: str, cell_type: str = 'CD4 T', show_summary: bool = False) None#
This function plots the expression class of a given gene of interest (GOI) across all cell types and a specific cell type. It creates two subplots, one for all cell types and one for the specific cell type, and optionally prints a summary of the expression class for each plot.
- Parameters:
GOI (str) – The gene of interest.
cell_type (str, optional) – The specific cell type to plot. Default is ‘T cell’.
show_summary (bool, optional) – Whether to print a summary of the expression class for each plot. Default is False.
adata (anndata.AnnData, optional) – The input AnnData object containing the gene expression data. Default is the global adata object.
- Returns:
None
- plot_outliers(GOI: str, cell_type: str = None, outlier_threshold: float = 0.15) None#
This function plots the expression versus detection data for a given cell type, with outliers highlighted. It also highlights a gene of interest (GOI) in the plot.
- Parameters:
GOI (str) – The gene of interest to be highlighted in the plot.
cell_type (str, optional) – The specific cell type to plot. If None, all cell types are plotted. Default is None.
outlier_threshold (float, optional) – The threshold for marking a point as an outlier. If the orthogonal distance of a point from the spline is greater than this threshold, it is marked as an outlier. Default is 0.15.
- Returns:
None
- plot_umap(color: str = 'celltype_l2') None#
This function plots a UMAP of the given AnnData object, grouping by coarse cell type in the adata object.
- Parameters:
color (str) – The color to use for the plot. Default is “celltypist_cell_label_coarse”.