Gene Regulatory Network Analysis#

GRNPipeline is a pipeline for gene regulatory network analysis and visualization, specifically for the gene of interest. It is designed to be used with the output of the SCENIC package, specifically with the adjacencies and regulons files. The SCENIC analysis is already implemented and the output is automatically processed by the pipeline. More gene sets are also implemented, specifically the ones from the Reactome database.

class grn_pipeline.GRNPipeline(wdir, adata, f_adj, f_reg, dir_gg_adj, gg_adj_files)#
GOI_network_stats(df: DataFrame, GOI: str) None#

This function prints a summary of a given gene of interest (GOI) and its regulons.

Parameters:
  • df (pandas.DataFrame) – The DataFrame containing the gene data.

  • GOI (str) – The gene of interest.

__init__(wdir, adata, f_adj, f_reg, dir_gg_adj, gg_adj_files)#

Initialize the GRNPipeline class. This function sets the working directory, loads the AnnData object, reads the adjacency and regulon data, gets the reactome, regulon genesets, and genesets, and reads the gene-gene adjacency files for different cell types.

Parameters:
  • wdir (str) – The working directory.

  • adata (AnnData) – The AnnData object.

  • f_adj (str) – The adjacency file.

  • f_reg (str) – The regulon file.

  • dir_gg_adj (str) – The directory containing the gene-gene adjacency files.

  • gg_adj_files (list of str) – The gene-gene adjacency files.

find_TFs(df: DataFrame, GOI: str) array#

This function finds the transcription factors (TFs) that regulate a given gene of interest (GOI).

Parameters:
  • df (pandas.DataFrame) – The DataFrame containing the regulon data.

  • GOI (str) – The gene of interest.

Returns:

An array of TFs that regulate the GOI.

Return type:

numpy.array

format_gene_summary(df: DataFrame, GOI: str) None#

This function formats and prints the gene summary from NCBI in a readable way.

Parameters:
  • df (pandas.DataFrame) – The DataFrame containing the gene data.

  • GOI (str) – The gene of interest.

gGOSt(regulon: str) DataFrame#

This function performs g:Profiler g:GOSt analysis for pathways in a given regulon, plots the results, and returns the result DataFrame. Currently only searching in the REAC, KEGG, and GO:BP databases. Some regulons may not have significant pathways, in which case the function will return a message.

Parameters:

regulon (str) – The name of the regulon.

Returns:

A DataFrame with the g:Profiler g:GOSt analysis results for the given regulon.

Return type:

pandas.DataFrame

gGOSt_listed(GOI: str) str#

Print the list of top three significantly differentially expressed pathways in genes co expressed with the gene of interest (GOI), per cell lineage. This function uses the g:Profiler API to find the pathways in the REACTOME and KEGG databases that are significantly differentially expressed in genes co expressed with the GOI. It prints the top three pathways for each cell lineage.

Parameters:

GOI (str) – The gene of interest.

Returns:

A string message indicating the completion of the function.

Return type:

str

genegene_importance_histograms(log_scale=False, xlim=10)#

Generate histograms of gene gene adjacency importance scores for different cell types. This function generates histograms for B Cells, Epithelium Cells, Myeloid Cells, Stroma Cells, and T Cells. The histograms show the distribution of gene gene adjacency importance scores for each cell type.

Parameters:
  • log_scale (bool) – Whether to use a logarithmic scale for the y-axis. Defaults to False.

  • xlim (int) – The upper limit of the x-axis. Defaults to 10.

get_entrez_gene_summary(gene_name: str, email: str, organism: str = 'human', max_gene_ids: int = 10) dict#

Returns the ‘Summary’ contents for provided input gene from the Entrez Gene database.

Parameters:
  • gene_name (str) – Official (HGNC) gene name (e.g., ‘KAT2A’)

  • email (str) – Required email for making requests

  • organism (str, optional) – Filters results only to match organism. Set to None to return all organism unfiltered. Default is ‘human’.

  • max_gene_ids (int, optional) – Sets the number of Gene ID results to return (absolute max allowed is 10K). Default is 100.

Returns:

Summaries for all gene IDs associated with gene_name (where: keys → [orgn][gene name], values → gene summary)

Return type:

dict

get_gene_summary(df: DataFrame, GOI: str, email: str = 'samantha.bening@helmholtz-munich.de') dict#

This function gets the gene summary from NCBI for a given gene of interest (GOI) and its transcription factors (TFs).

Parameters:
  • df (pandas.DataFrame) – The DataFrame containing the gene data.

  • GOI (str) – The gene of interest.

  • email (str, optional) – The email to use for making requests to NCBI. Default is ‘samantha.bening@helmholtz-munich.de’.

Returns:

A dictionary with the gene summaries, where the keys are the gene names and the values are the summaries.

Return type:

dict

get_genesets() DataFrame#

This function concatenates the reactome and regulon dataframes, resets the index, and returns the result.

Parameters:
  • reactome (pandas.DataFrame, optional) – The DataFrame containing the reactome data. Default is reactome.

  • reg_df (pandas.DataFrame, optional) – The DataFrame containing the regulon data. Default is regulon_geneset.

Returns:

A DataFrame with the concatenated reactome and regulon data.

Return type:

pandas.DataFrame

get_goi_pathways(GOI, method='spearman')#

This function calculates and ranks the correlation between each pathway’s AUCell score and the expression of the gene of interest (GOI).

Parameters:
  • GOI (str) – The gene of interest.

  • geneset_df (pandas.DataFrame, optional) – The DataFrame containing the geneset data. Default is geneset_df.

  • adata (anndata.AnnData, optional) – The AnnData object containing the single-cell data. Default is adata.

  • method (str, optional) – The method to use for calculating correlation. Default is ‘spearman’.

Returns:

A DataFrame with the pathways for the GOI, sorted by absolute correlation value.

Return type:

pandas.DataFrame

get_reactome() DataFrame#

This function retrieves the reactome data, filters it based on geneset size, and returns it as a DataFrame.

Returns:

A DataFrame with the filtered reactome data.

Return type:

pandas.DataFrame

get_regulon_genes(reg_df: DataFrame, TF: str) DataFrame#

Returns a DataFrame with all target genes and importance scores for a given transcription factor (TF).

Parameters:
  • reg_df (pandas.DataFrame) – The DataFrame containing the regulon data.

  • TF (str) – The name of the transcription factor.

Returns:

A DataFrame with all target genes and importance scores for the given TF.

Return type:

pandas.DataFrame

get_regulon_geneset(regulon=None) List[str]#

This function returns the gene set of a specific regulon. If no regulon is specified, it defaults to the first regulon in the list.

Parameters:
  • df (pandas.DataFrame, optional) – The DataFrame containing the geneset data. Default is geneset_df.

  • regulon (str, optional) – The name of the regulon. If None, the first regulon in the list is used. Default is None.

Returns:

A list of genes in the specified regulon.

Return type:

List[str]

get_regulon_genesets() DataFrame#

Returns a DataFrame with all genesets for each unique transcription factor (TF) in the given regulon data.

Parameters:

reg_df (pandas.DataFrame) – The DataFrame containing the regulon data.

Returns:

A DataFrame with all genesets for each unique TF.

Return type:

pandas.DataFrame

static gmt_to_decoupler(pth: Path) DataFrame#

Parse a gmt file to a decoupler pathway dataframe.

Parameters:

pth (Path) – The path to the gmt file.

Returns:

A DataFrame with the geneset and genesymbol from the gmt file.

Return type:

pandas.DataFrame

make_adj_df(adj_df: DataFrame, GOI: str) DataFrame#

This function creates a DataFrame for a given gene of interest (GOI) with its adjacencies sorted by importance.

Parameters:
  • adj_df (pandas.DataFrame) – The DataFrame containing the adjacency data.

  • GOI (str) – The gene of interest.

Returns:

A DataFrame with the adjacencies of the GOI, sorted by importance, and a group column set to ‘adjacencies’.

Return type:

pandas.DataFrame

make_gene_gene_network(GOI: str, top_n: int = None, out_file: str = 'src/SCENICfiles/network') str#

This function creates a network visualization of a given gene of interest (GOI) and its co expressed genes.

Parameters:
  • df (pandas.DataFrame) – The DataFrame containing the gene data.

  • GOI (str) – The gene of interest.

  • top_n (int, optional) – The number of top neighbors of each regulon (other than GOI) to include. If None, no regulon neighbors (other than GOI) are included. Default is None.

  • out_file (str, optional) – The path to the output file where the network visualization will be saved.

make_goi_grn(GOI: str) DataFrame#

This function creates a gene regulatory network (GRN) for a given gene of interest (GOI).

Parameters:
  • GOI (str) – The gene of interest.

  • df (pandas.DataFrame, optional) – The DataFrame containing the regulon data. Default is reg_df.

Returns:

A DataFrame representing the GRN of the GOI, sorted by importance and only including target genes that appear more than once.

Return type:

pandas.DataFrame

make_network(df: DataFrame, GOI: str, direct_TF: bool = True, top_n: int = None, out_file: str = 'src/SCENICfiles/network') str#

This function creates a network visualization of a given gene of interest (GOI) and its regulons.

Parameters:
  • df (pandas.DataFrame) – The DataFrame containing the gene data.

  • GOI (str) – The gene of interest.

  • direct_TF (bool, optional) – If True, only direct neighbors of the GOI are included. If False, all neighbors are included. Default is True.

  • top_n (int, optional) – The number of top neighbors of each regulon (other than GOI) to include. If None, no regulon neighbors (other than GOI) are included. Default is None.

  • out_file (str, optional) – The path to the output file where the network visualization will be saved. Default is ‘src/gene_report/goi_network.html’.

make_regulon_dataframe(df: DataFrame, TF: str) DataFrame#

This function creates a DataFrame for a given transcription factor (TF) with its target genes and their importance.

Parameters:
  • df (pandas.DataFrame) – The DataFrame containing the regulon data.

  • TF (str) – The transcription factor.

Returns:

A DataFrame with the target genes of the TF, their importance, the TF, and the group (TF_regulon).

Return type:

pandas.DataFrame

plot_pathways(df: DataFrame, GOI: str) None#

This function plots the UMAP of the top 5 pathways along with the cell type and the gene of interest (GOI).

Parameters:
  • df (pandas.DataFrame) – The DataFrame containing the pathway data.

  • GOI (str) – The gene of interest.

  • adata (anndata.AnnData, optional) – The AnnData object containing the single-cell data. Default is adata.

plot_regulon_expression(df: DataFrame, GOI: str) None#

This function plots the expression of a given gene of interest (GOI) and its regulons.

Parameters:
  • df (pandas.DataFrame) – The DataFrame containing the gene data.

  • GOI (str) – The gene of interest.

  • adata (anndata.AnnData, optional) – The AnnData object containing the single-cell data. Default is adata.

show_network(GOI: str, type: str = 'gene_gene', top_n: int = 5) None#

Display the HTML content of a given file. This function reads an HTML file generated by either the make_goi_grn or make_gene_gene_network method, depending on the type parameter, and displays its content.

Parameters:
  • GOI (str) – The gene of interest.

  • type (str) – The type of network to display, either ‘regulon’ or ‘gene_gene’. Defaults to ‘gene_gene’.

  • top_n (int) – The number of top genes to include in the network. Defaults to 5.