Open access peer-reviewed chapter - ONLINE FIRST

Gene Regulation via RNA Isoform Variations

Written By

Bin Zhang and Chencheng Xu

Submitted: 29 February 2024 Reviewed: 29 February 2024 Published: 24 May 2024

DOI: 10.5772/intechopen.1005044

Beyond the Blueprint - Decoding the Elegance of Gene Expression IntechOpen
Beyond the Blueprint - Decoding the Elegance of Gene Expression Edited by Morteza Seifi

From the Edited Volume

Beyond the Blueprint - Decoding the Elegance of Gene Expression [Working Title]

Ph.D. Morteza Seifi

Chapter metrics overview

18 Chapter Downloads

View Full Metrics

Abstract

The completion of the draft and complete human genome has revealed that there are only around 20,000 genes encoding proteins. Nonetheless, these genes can generate eight times more RNA transcript isoforms, while this number is still growing with the accumulation of high-throughput RNA sequencing (RNA-seq) data. In general, over 90% of genes generate various RNA isoforms emerging from variations at the 5′ and 3′ ends, as well as different exon combinations, known as alternative transcription start site (TSS), alternative polyadenylation (APA), and alternative splicing (AS). In this chapter, our focus will be on introducing the significance of these three types of isoform variations in gene regulation and their underlying molecular mechanisms. Additionally, we will highlight the historical, current, and prospective technological advancements in elucidating isoform regulations, from both the computational side such as deep-learning-based artificial intelligence, and the experimental aspect such as the long-read third-generation sequencing (TGS).

Keywords

  • gene regulation
  • RNA isoform
  • RNA-seq
  • next generation sequencing
  • third generation sequencing
  • transcription start site
  • alternative splicing
  • alternative polyadenylation
  • deep learning

1. Introduction

Since the initial release of the human genome draft in 2001 [1, 2], gaps or unplaced sequences in the genome have been solved continuously. In 2022, the telomere-to-telomere (T2T) consortium published the first complete sequence of a human genome [3]. With these genome sequences as the reference, genes have been annotated accordingly and now it is well accepted that there are only around 20,000 genes encoding proteins in the human genome. For instance, based on the GENCODE annotation database [4], the number of protein-coding gene (PCG) has been almost invariable in the last decade (Figure 1A). However, the number of annotated RNA transcripts transcribed from these genes gradually increased (Figure 1B). In general, over 90% of them transcribed multiple RNA transcripts known as isoforms, with variations at 5′ and 3′ end, as well as different exon combinations (Figure 1C).

Figure 1.

The number of annotated genes encoding a protein (A) and transcripts transcribed from these genes (B) in the GENCODE database since 2011. The X-axis in (A) and (B) is the released versions of GENCODE annotations. Each dot in (A) represents the number of genes in each release and the segmented colors indicate the released years. Bars in (B) are colored for different transcript types, in which NMD stands for nonsense mediated decay. In general, transcripts transcribed from PCGs are classified into four types, including protein_coding, process_transcript, retained_intron, and NMD. The classification is based on their coding potential (protein_coding) and other properties, such as whether containing un-spliced introns (retained_intron) or premature termination codons that trigger mRNA degradation (NMD). To be noted, the GENCODE transcript type ‘processed_transcript’ was defined as long noncoding RNA (‘lncRNA’) in v31 and v32, and has been redefined as ‘protein_coding_CDS_not_defined’ since v42.

The variation at the 5′ and 3′ end of isoforms are known as alternative transcription start site (TSS) and alternative polyadenylation (APA), respectively, whereas isoform variations formed by different exon combinations are mediated by alternative splicing (AS). The three types of variations occur at different stages of RNA processing. TSS selection happens when transcription is initiated, and RNA splicing is a posttranscriptional or co-transcriptional process [5]. During transcription termination, nascent RNA molecules undergo cleavage and the addition of poly-adenosine (poly-A) tails, known as polyadenylation [6]. APA involves utilizing varied polyadenylation sites (PASs) to generate isoforms with distinct 3′ ends. In Section 2, we will introduce the molecular mechanisms underlying these three types of variations.

The variable RNA isoforms not only enable the limited number of genes to generate a much larger number of proteins but also greatly increase the complexity of gene regulation even when variations only impact the noncoding sequences of RNA. Dysregulated isoform variations contribute significantly to pathogenesis as they impair the tightly controlled gene regulations. For instance, approximately 15–50% of human genetic disorders are caused by mutations impairing RNA splicing [7]. In cancer, mutations impairing splicing are also frequent, resulting in widespread dysregulated splicing events. Besides, cancer cells expressed isoforms utilizing distinct PAS. In addition, pervasive regulations of TSS through alternative promoters have been observed in tumor samples [8]. These dysregulations cover both coding sequence (CDS) and noncoding sequence of RNA, such as untranslated region (UTR) and intron. The detailed functional consequence of isoform variations at the molecular and cellular level will be elucidated in Section 3 and their dysregulations in human disease will be introduced in Section 4.

In the last two decades, the next generation sequencing (NGS), has revolutionized RNA profiling for different cell types or cells under different conditions [9]. While the primary goal of the profiling is to quantify expression at the gene level, there are still many endeavors focusing on isoform variations, mainly for alternative splicing since its quantification is easier than that of TSS and APA. Hence, specialized experimental protocols have been designed to more effectively capture TSS or PAS. Computational methods have also been developed for detecting them from conventional RNA sequencing (RNA-seq) data despite the performances remaining far from satisfactory.

In the last section of this chapter, we will list the impactive experimental and computational methods for identifying and quantifying the three types of isoform variations. In addition, with the technological advancements of machine learning, enormous methods and models have been proposed for predicting alternative TSS, splicing, and polyadenylation using DNA/RNA sequences as inputs. Their performance continuously improved and many tools become promising for evaluating impact of genetic variants in isoform variations and further predicting the disease risks. However, all these methods are limited to variations at the event level, and overlook the full-length sequence of isoforms. Consequently, at the end of this chapter, we will highlight the powerfulness of third-generation sequencing (TGS) in detecting full-length isoform variations and its potential for training the nature of intact RNA molecule-centric deep-learning models.

Advertisement

2. Molecular mechanisms underlying RNA isoform variations

Gene regulation usually relies on the binding or recruitment of trans-acting factors on cis- elements resigning in DNA or RNA. The regulation of TSS, polyadenylation, and splicing will be introduced separately. At the end of this section, we will highlight the links among these three types of regulations.

2.1 TSS regulation

TSS is determined by the assembly of a transcriptional initiation complex comprising transcriptional machinery such as RNA polymerase and general transcription factors (GTFs) at core promoters [10]. Promoters share various common sequence features, as the most well-documented motif is the TATA-box (TATAWAW) first identified in 1978 [11]. Furthermore, the assembly process is regulated by transcription factors (TFs) and their cofactors that bind to enhancers. Recent studies revealed that these factors form condensates, a liquid-phase-like membrane-less organelle, to coordinate transcription initiation and elongation [12]. Different TFs have distinct sequence binding preferences in DNA, thereby shaping the varied landscape of activated promoters across different cell types and tissues. In addition, whether to utilize a promoter for transcribing RNA is also controlled by epigenetic information, such as nucleosome-free positions, DNA methylation, and posttranslational modifications on histones. Overall, the availability of different TFs and heterogenous promoter epigenetic modifications together regulates alternative TSS utilization (Figure 2A).

Figure 2.

The schematics show the regulation of alternative TSS (A), polyadenylation (B), and splicing (C). (A) TSS is determined by the core promoter, whose activation depends on the binding of transcription factors (TFs) at the upstream enhancer regions, epigenetic markers, and nucleosome positions. (B) Polyadenylation is regulated by a protein complex, including cleavage and polyadenylation specificity factors (CPSF), cleavage stimulation factors (CSTF), polyA polymerase, and so on. (C) The regulation of alternative splicing is very complicated, for which different splice sites can compete with each other. Moreover, elements within exonic and intronic regions could also enhance (splicing enhancer) or repress (splicing repressor) the splicing. The complex regulations result in multiple types of alternative splicing events, such as skipped exon (SE), alternative 5′ and 3′ splice site (A5SS and A3SS), and retained intron (RI).

2.2 Polyadenylation and alternative polyadenylation (APA)

On the other hand, polyadenylation is mainly regulated by several groups of RNA-binding proteins (RBPs), including cleavage and polyadenylation specificity factors (CPSF), cleavage stimulation factors (CstF), and so on (Figure 2B). During transcription termination, RNA polymerase pauses downstream of the cleavage/polyadenylation site (PAS), thereafter CPSF and CstF complex are recruited via recognizing motifs on the nascent RNA [6]. The hexamer sequence AAUAAA, also called PAS signal or PAS motif, located 15–40 nt upstream of the cleavage site, is the most important cis-element for polyadenylation [13]. Overall, within 50 nt sequences upstream of polyadenylation sites (PASs) in human and mouse genomes, more than half of them contain AATAAA and 80% of the rest harbors its variants such as ATTAAA [14, 15]. Alternative utilization of PAS (APA) is regulated by trans-factors enhancing or inhibiting polyadenylation. For instance, RNA binding protein (RBP) NUDT21 (also known as CFIm25) can repress proximal PAS usage in 3′ UTR, and its downregulation results in 3′ UTR shortening in glioblastoma [16].

2.3 Splicing and alternative splicing

Nevertheless, splicing regulation is even more sophisticated, in which spliceosome, the largest protein complex in human cells that also consists of small nuclear RNAs (snRNA), recognizes donor and acceptor splice sites (SSs) in RNA to define exons [17]. The core component of spliceosome is small nuclear ribonucleoproteins (snRNPs) consisting of snRNA and Sm protein or like Sm (LSm) proteins. The U1, U2, U4, U5, and U6 snRNPs regulate >99% of splicing events, while their variant, so-called minor spliceosome formed by the U11, U12, U4atac and U6atac, and U5 regulate the rest of splicing events [18, 19]. snRNPs difference of the two spliceosomes result in two types of introns. The major ones or U2-types start with almost invariable GU dinucleotides at the 5′ splice site (SS) and end with AG dinucleotides at the 3′ SS, whereas U12-type introns start with GU or AU and end with AC or AG [20]. Classically, the 9 nt sequence (−3 to +6) around 5′ SS, and the 23 nt sequence (−20 to +3) around 3′ SS are used for calculating splicing strength, respectively [21].

During RNA maturation, introns are removed, and the two flanking exons are joined together. However, multiple SSs from the same gene can compete with each other, resulting in variable exon definitions and leading to complicated forms of alternative splicing, such as skipping single or multiple exons (SE), retaining introns (RI), and utilizing non-canonical 5′ SS or 3′ SS (A5SS and A3SS) (Figure 2C). Even more surprisingly, studies revealed that downstream exons can be back-spliced to join upstream exons, which generates RNA circles or so-called circular RNA [22, 23, 24]. In addition to the core spliceosome, other trans-factors also contribute to splicing via enhancing or repressing the definition of exons through the interaction with the core spliceosome, further increasing the complexity of splicing choice. Overall, the largest proportion of human isoform variations are from alternative splicing.

2.4 Crosstalk among the three types of variations

Despite the distinct molecular mechanisms underlying alternative TSS, polyadenylation, and splicing, the regulation of these variations is not fully independent. Intuitively, utilizing alternative TSS is associated with alternative splicing of the first exon, while alternative splicing of the last exon impacts the choice of polyadenylation site. In addition, the PAS signal is present in almost every intron in the human genome, suggesting a balance between removing intron by splicing or termination of transcription by polyadenylation. Indeed, studies reveal that U1 snRNP protects pre-mRNA from drastic pre-mature termination at PAS in introns [25], and U1 motifs and PAS signals together shape the landscape of promoter directions [26]. Interestingly, a recent study reveals that the choice of 3′ ends for polyadenylation is globally influenced by the selection of TSS [27], suggesting couplings among the regulation for three types of variations.

Advertisement

3. Functional consequence of RNA isoform variations

In general, isoform variations have three kinds of consequences, including (a) altering open reading frame (ORF) including truncation, (b) changing noncoding sequence of RNA such as untranslated regions (UTR), and (c) triggering mRNA degradation via introducing pre-mature termination codon (PTC) or frameshift. Alternative TSS can impact 5′ untranslated region (5′ UTR) or alter N-terminal or even the complete ORF (Figure 3A). Similarly, APA can regulate the shortening or lengthening of 3′ UTRs and change C-terminal of the ORF (Figure 3B). Traditionally, splicing has been investigated predominantly in coding sequence (CDS), which either leads to ORF alterations or message RNA (mRNA) degradation via introducing PTC. However, our recent study reveals pervasive splicing within 3′ UTR [28], indicating AS influences cover all three kinds of consequences (Figure 3C).

Figure 3.

Three kinds of consequences induced by alternative TSS (A), polyadenylation (B), and splicing (C) at the molecular level. SE: skipped exon; A5SS and A3SS: alternative 5′ and 3′ splice site; RI: retained intron. The narrow green and red bars indicate the start and end codon, respectively.

While ORF alterations can drastically impact functions of the encoded protein, noncoding sequences changes mainly regulate the final protein production. For instance, varied 5′ UTR affect mRNA translation efficiency [29], while 3′ UTR variations regulate mRNA stability and translation efficiency [30]. mRNA degradation via introducing PTC usually results in loss of function and its eventual effect depends on the proportion of the isoform. For instance, widespread intron retention has a moderate effect on gene regulation by tuning mammalian tissue RNA expression [31]. Besides, conserved cassette exons containing PTC in the brain are particularly enriched in RBPs, such as splicing factors and other RNA metabolic process regulators, which enables potential gene autoregulation [32].

3.1 Isoform variations regulate biological processes at the cellular level

Since isoform variations or switches have broad impacts on gene expression, they induce diverse functional consequences at the cellular level. An interesting example is about cell stemness and differentiation, where splicing has been the most extensively investigated. For instance, a conserved AS event controls the inclusion or exclusion of a cassette exon in FOXP1 that encodes the protein domain for DNA binding preference, further regulating embryonic stem cell pluripotency and reprogramming [33]. Besides, intron retention changes in a group of genes, including Lmnb1, regulate granulocyte differentiation via degradation of intron-retained RNA isoforms by nonsense mediated decay (NMD) [34].

In addition to AS, APA also contributes significantly to cell fate commitment, as evident by the perturbation of NUDT21, an APA regulator controlling 3′ UTR lengthening or shortening for thousands of transcripts [35]. Beyond differentiation, AS and APA are both capable of regulating cell proliferation and cell cycle progression. An interesting phenomenon is that active proliferated cells tend to express isoform with shortened 3′ UTR via APA [36]. In general, both AS and APA exhibit substantial tissue specificities. Therefore, in addition to transcription-determined gene expression, isoform variation serves as another layer of regulation, playing critical roles in cellular functions that shape phenotypic differences across tissues and organs.

Advertisement

4. Dysregulated RNA isoform variations in disease

As RNA isoform variation acts as a critical layer of gene regulation that is tightly controlled across different cell types and under different conditions, its dysregulation can disrupt gene functions and lead to diseases. In this section, we will mainly introduce isoform dysregulations in human genetic disorders and cancers. The aberrant splicing caused genetic disorders, can be classified into two classes, affecting cis-element responsible for appropriate splicing, and disrupting trans-factors regulating splicing such as components in spliceosome. The former includes mutations directly disrupting splice sites, accounting for an estimated 15% of human genetic disorders [37], as well as mutations that affect other splicing-related cis-elements. In total, it has been proposed that 50% or even 60% of human diseases are caused by these cis-element mutations [7]. For the mutations impairing trans-factors regulating splicing, an extensively studied example is spinal muscular atrophy (SMA). SMA is caused by mutations in the SMN1 gene, which affects around 1/4000 to 1/16,000 births worldwide [38]. SMN1 gene encodes SMN, a protein responsible for snRNP assembly, and thus its mutations should impair splicing globally. Compared to splicing, diseases caused by dysregulated polyadenylation are much less discovered. Still, studies reported several mutations disrupting PAS signal and causing hematological disorders [39].

Nevertheless, dysregulation for all three types of isoform variations has been observed in cancer. A study systematically quantifying promoter activities by analyzing 18,468 RNA-seq samples across 42 cancer types reveals widespread utilization of alternative promoters [8]. In line with the observation that active proliferated cells are preferentially expressing isoform with shortened 3′ UTR [36], 3′ UTR are globally shortened in cancer cells via APA compared to normal cells [40, 41]. Moreover, intronic polyadenylation (IPA), another type of APA that is located within the intron, is widespread in leukemia, which induces truncation of the encoded protein to inactivate tumor suppressor [42]. While the mechanism underlying widespread IPA is still under exploration, one of the main contributors of global 3′ UTR shortening in cancer cells is pointing to NUDT21 (CFIm25). To date, NUDT21 has been reported to regulate APA and contribute to cancer progression in glioblastoma [16], hepatocellular carcinoma [43], lung cancer [44], bladder cancer [45], and cervical cancer [46].

In addition to alternative TSS and APA, splicing has been much more extensively studied in cancer at both the DNA level and RNA level. Mutations disrupting canonical splice sites have been identified in various essential cancer-related genes, such as TP53 and BRCA1 [47], and this is a widespread mechanism for inactivating tumor suppressors [48]. Besides, recurrent somatic mutations in splicing factors have also been identified with representative studies showing frequent mutations in genes SF3B1, SRSF2, U2AF1, and ZRSR2 in hematopoietic malignancy, such as myelodysplastic syndromes (MDS) and leukemia [49, 50, 51]. A more recent study characterizing splicing factor mutations across 33 cancer types demonstrates that hotspot mutations of SF3B1 and U2AG1 are also frequent in multiple solid tumors as well [52]. Beyond mutations in splicing factors, MYC, a well-known proto-oncogene encoding a transcription factor, indirectly regulates splicing by promoting the expression of core spliceosome components as an essential step in lymphomagenesis [53]. These studies suggest potential high frequent splicing dysregulations in cancer. Indeed, comprehensive analysis of AS based on RNA-seq samples from 8705 cancer patients reveals more active AS in tumors compared to normal samples, as well as hundreds of aberrant splicing with novel exon-exon junctions that are not present in normal samples [54].

Taken together, these studies demonstrate the high frequency of dysregulated isoform variations in diseases, while aberrant splicing is the most evident. Many of these dysregulations disrupt gene functions by isoform switching and contribute significantly to pathogenesis. On the other hand, they may also serve as therapeutic targets for treatment. For instance, an antisense oligonucleotide drug modulating AS of gene SMN2 has been approved by the FDA for SMA treatment. Additionally, dysregulated isoform variations in cancer that induce novel ORFs such as splicing and IPA may serve as sources of neoantigens [55, 56], which are promising to be utilized for cancer vaccine development.

Advertisement

5. Technology advancement in quantifying and predicting isoform variations

The NGS is a technology that is capable of parallelly sequencing massive DNA fragments, up to hundreds of millions in one experiment. Applying NGS to complementary DNA (cDNA) libraries reverse transcribed from RNA, known as RNA-seq, has revolutionized gene expression profiling. In addition, RNA-seq is also efficient in detecting and quantifying alternative splicing events with reads (short fragment sequenced by the NGS) supporting splice junctions. An extensively used metric is the percentage of splicing in (PSI) that is calculated by reads supporting splicing in divided by the sum of reads supporting both splicing in and out, measuring the proportion of isoforms with splicing in. With this, global AS profiling has been performed across different tissues and species [57, 58, 59], as well as organs at different development stages [60].

As detecting and quantifying AS events from RNA-seq data are relatively straightforward, the computational methods for AS mainly focus on differential splicing analysis. rMATs and DEXSeq are two representative methods that have been widely utilized [61, 62]. However, accurate identification and quantification of alternative TSS and APA directly from RNA-seq data is more challenging compared to AS. To this end, various experimental methods have been developed based on TSS and PAS properties. We will briefly list these experimental methods together with the effects of computational approaches for accurate identification and quantification of TSS and PAS expression in Section 5.1. On the other hand, distinct cis-elements determine the definition and utilization of TSS, PAS, and splice site, endeavors have been conducted to predict them in silico, which involves both traditional machine learning technologies and more recent advancements in deep-learning-based artificial intelligence (AI), which will be highlighted in Section 5.2.

5.1 Identifying and quantifying alternative TSS and PAS from NGS data

There are two types of high-throughput experimental approaches for comprehensive TSS profiling. As mature transcripts transcribed by RNA polymerase II have a specific cap-like structure at the 5′ ends, cap analysis gene expression (CAGE) is a method capable of enriching 5′ ends of RNA. CAGE followed by massive parallel sequencing (CAGE-seq) is a high-throughput approach for transcriptome-wide TSS profiling [63]. Additionally, active promoter regions in chromatin are enriched with several specific histone modification markers, including tri-methylation on histone 3 lysine 4 (H3K4me3), acetylation on histone 3 lysine 9 (H3K9ac), and histone 3 lysine 27 (H3K27ac). The second type of method for inferring TSS utilization is based on chromatin immunoprecipitation (ChIP) assays with sequencing (ChIP-seq) for these three markers, even though with relatively lower resolution [64]. With CAGE-seq, the Functional ANnoTation of the Mammalian Genome (FANTOM) project identified 201,802 putative TSS across dozens of cell lines [65]. Among them, 70% (143,200) are from genic regions, while only 40% (56,793 out of 143,200) are associated with annotated transcripts based on GENCODE (Figure 4A and B), suggesting that the current transcript isoform annotation might be still not comprehensive. On the other hand, computational methods have also been developed to annotate and quantify TSS from RNA-seq data utilizing splice junctions across the first and second exons [8], or RNA-seq coverages together with sequence features [66].

Figure 4.

Statistics of putative TSS and PAS identified by sequencing approaches from three databases. (A) The schematic shows the definition of four genomic regions based on the GENCODE annotation. (B) The pie chart shows the proportion of TSS across the four regions and the barplot illustrates the number of TSS associated with annotated transcript isoforms (distance <50 bp to the transcript start sites). (C) The proportion of PAS across the four regions. (D) Venn diagram showing the overlapped PAS across the two databases and GENCODE annotation. To be noted, the numbers in D and C are not consistent because multiple PASs might be merged together when calculating the overlap (distance <50 bp).

For global PAS identification and APA quantification, dozens of experimental protocols have been developed. The majority of them are designed by enriching fragments of transcript 3′ end close to or comprising poly-A tail, such as 3P-seq, Aseq, PolyA-seq, and 3′ READS [67, 68, 69, 70]. Accordingly, two widely used datasets have been constructed using data from these experimental 3′ end sequencing (3′ end-seq) approaches, including PolyASite and PolyA_DB [71, 72]. 53% of PolyASite (v2) PAS and 64% of PolyA_DB (v3) PAS are from genic region (Figure 4C). Even though the two databases are both obtained from 3′ end-seq data, the overlapped ones only account for 39% of the sites from PolyASite and 66% of the sites from PolyA_DB, suggesting the heterogenous of different 3′ end-seq protocols, for instance PolyA_DB uses the data from 3’ READS, while PolyASite uses the data from 3P-seq, Aseq, PolyA-seq et al. Moreover, 86% of PolyASite PAS and 79% of PolyA_DB PAS are not associated with any annotated transcript in GENCODE, again indicating the isoform annotation is far from complete (Figure 3D). In addition to experimental approaches, computational method for directly identifying PAS and quantifying APA from RNA-seq data are also feasible, while just the efficiency and accuracy are suboptionable. Almost all of them are designed by detecting drop points in RNA-seq coverages along the gene body, while many of them are only able to identify and quantify APA within 3′ UTR, such as TAPAS [73], QAPA [74], GETUTR [75], APAtrap [76], DaPars2 [77], and Aptardi [78]. Besides, IPAFinder is designed specifically for APA within intronic regions from RNA-seq data [61]. Nevertheless, we recently have developed APAIQ , an accurate method capable of transcriptome-wide APA identification and quantification from RNA-seq data, showing much higher precision and recall than previous methods [14].

5.2 Predicting isoform variations with DNA/RNA sequences

Owing to the heterogenous flexibilities of sequence features for TSS, PAS, and splice site, as well as their distinct pathogenic impacts, predicting the three types of isoform variations with DNA/RNA sequence are under different developmental stages and have different focuses. To date, there are many computational methods for predicting TSS but none of them is capable of predicting utilization of alternative TSS from the same gene. Hence, in this section, we will only focus on alternative polyadenylation and splicing. We will briefly introduce the historic computational methods and highlight the recent advancements utilizing artificial intelligence (AI) techniques in predicting these two kinds of variations.

5.2.1 Predicting PAS and APA

The early methods for PAS prediction mainly aim to discriminate true PAS from pseudo-ones that also comprises the PAS motif, via utilizing hand-crafted features from statistic frequency analysis [62, 79, 80, 81]. Thereafter, utilizing a set of latent sequence features extracted by Hidden Markov Models (HMM) trained with a benchmark dataset further significantly improves the accuracy [82]. These methods rely on a set of features based on prior knowledge or extracted by machine learning, which cannot capture the sequence information in full. With the advancement of deep neuron networks, sequences around PAS without further feature selection have been directly used as the input for deep learning models, which achieved good performance in binary classification of true/false PAS in multiple benchmark datasets [83, 84]. Still, they are not able to quantitatively predict the strength of each PAS, let alone the alternative usage across multiple PAS. To this end, we have developed DeeReCT-APA, a deep-learning architecture for predicting the usage of alternative PAS of a given gene [85]. However, DeeReCT-APA remains not feasible to evaluate strength of each PAS along the genome and predict mutation outcomes in APA. In 2023, using the large-scale PAS dataset from PolyA_DB, PolyAID achieved nucleotide resolution prediction of PAS along the genome. By optimizing the local gene structure of each PAS, PolyAID can predict its strength and usage. More importantly, applying PolyAID to scan the genome reveals thousands of genetic variants potentially impacting polyadenylation activity [86].

5.2.2 Predicting alternative splicing

Traditional computational methods for predicting splice events typically rely on motifs [87, 88, 89]. These methods assume the existence of characteristic sequences, or motifs, near splice acceptor and donor sites. Any mutations disrupting these motifs can consequently impact splicing events. While motif-based approaches offer intuitiveness and interpretability, they suffer from limitations in coverage and fail to fully capture the intricate regulatory mechanisms governing splicing. To overcome these limitations, some studies have integrated machine learning techniques, such as support vector machines and random forests, with splice event prediction [90, 91, 92]. While these traditional machine learning algorithms perform better than motif-based methods, they heavily rely on feature engineering, limiting their applicability and generalization.

SpliceAI pioneered the prediction of alternative splicing events through end-to-end deep learning models, utilizing gene sequences directly as inputs to estimate the probabilities of each position being an acceptor or donor site [93]. In comparison to predecessors such as MMSplice [94] and HAL [95], SpliceAI substantially extends the input sequence length to over 10,000 bases, facilitating the consideration of a broader range of information surrounding splice sites, especially regulatory elements around splice sites. To effectively process such extensive sequences, a residue-connected dilated convolutional neural network is employed. SpliceAI has exhibited remarkable performance in both splice site identification and the prediction of mutation impacts on alternative splicing events. However, SpliceAI lacks the capacity to differentiate between variations across different tissues and organisms, hindering its generalizability.

Several subsequent studies have endeavored to enhance performance based on SpliceAI, particularly in predicting the effects of mutations on alternative splicing. These efforts involve the integration of data from multiple species and multiple tissues [96], curated alternative splice sites [97], the scaling law [98, 99], and predictions from other tools [100]. Another approach to predicting splice events using deep learning models involves analogizing gene sequences to natural language text and training large language models on existing sequencing data to forecast splicing and mutation effects, as demonstrated by Enformer [101], DNABERT [102], and Hyenadna [103]. These methods rely on transformer-based architectures and large-scale sequencing data, which enable significant extension of the model’s receptive field. For instance, Enformer achieves a receptive field of 100 kb and has demonstrated state-of-the-art performance on multiple gene expression prediction tasks. It is important to note that while these methods are designed for general gene sequence prediction tasks, they may not always perform optimally in predicting splice events.

Despite great progress in methods for identifying and quantifying alternative TSS, splicing, and polyadenylation from RNA-seq data, as well as deep-learning-based methods for predicting isoform variations with DNA/RNA sequence, they are all designed for detecting variations at the event level, rather than variations across different intact RNA isoforms. The TGS, from Pacific Bioscience (PacBio) and Oxford Nanopore Technology (ONT), which enable high-throughput sequencing of DNA or RNA with long read up to 25 kb and 300 kb, respectively, emerge as a powerful tool for full-length RNA isoform profiling. In 2022, a study implemented ONT to 88 samples from the genotype-tissue expression (GTEx) tissues and cell lines, revealing significant couplings between multiple alternative splicing events across isoforms and identified allelically specific utilization of isoforms [104]. This highlights the significance of utilizing full-length isoforms in assessing the outcome of genetic variants. Moreover, ~60% of TSSs captured by CAGE-seq and over 80% of PAS identified by 3′ end-seq data are not associated with any transcripts in GENCODE, suggesting the current isoform annotation is far from complete, raising the unmet needs for isoform identification with TGS data. With the accumulation of full-length isoform profiling data and the advancement of AI techniques, it is anticipated to see isoform centric deep-learning models that encompass all types of variation events and predict the outcome at the nature intact RNA molecular level.

References

  1. 1. Olsen UD et al. Initial sequencing and analysis of the human genome. Nature. 2001;409(6822):860-921
  2. 2. Venter JC et al. The sequence of the human genome. Science. 2001;291(5507):1304-1351
  3. 3. Nurk S et al. The complete sequence of a human genome. Science. 2022;376(6588):44-53
  4. 4. Harrow J et al. GENCODE: The reference human genome annotation for the ENCODE project. Genome Research. 2012;22(9):1760-1774
  5. 5. Tilgner H et al. Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs. Genome Research. 2012;22(9):1616-1625
  6. 6. Proudfoot NJ. Transcriptional termination in mammals: Stopping the RNA polymerase II juggernaut. Science. 2016;352(6291):aad9926
  7. 7. Wang G-S, Cooper TA. Splicing in disease: Disruption of the splicing code and the decoding machinery. Nature Reviews Genetics. 2007;8(10):749-761
  8. 8. Demircioğlu D et al. A pan-cancer transcriptome analysis reveals pervasive regulation through alternative promoters. Cell. 2019;178(6):1465-1477 e17
  9. 9. Wang Z, Gerstein M, Snyder M. RNA-Seq: A revolutionary tool for transcriptomics. Nature Reviews Genetics. 2009;10(1):57-63
  10. 10. Haberle V, Stark A. Eukaryotic core promoters and the functional basis of transcription initiation. Nature Reviews Molecular Cell Biology. 2018;19(10):621-637
  11. 11. Lifton R et al. The organization of the histone genes in Drosophila melanogaster: Functional and evolutionary implications. In: Cold Spring Harbor symposia on Quantitative Biology. Cold Spring Harbor Laboratory Press; 1978. DOI: 10.1101/SQB.1978.042.01.105
  12. 12. Cramer P. Organization and regulation of gene transcription. Nature. 2019;573(7772):45-54
  13. 13. Proudfoot N, Brownlee G. 3′ non-coding region sequences in eukaryotic messenger RNA. Nature. 1976;263(5574):211-214
  14. 14. Long Y et al. Accurate transcriptome-wide identification and quantification of alternative polyadenylation from RNA-seq data with APAIQ. Genome Research. 2023;33(4):644-657
  15. 15. Xiao MS et al. Global analysis of regulatory divergence in the evolution of mouse alternative polyadenylation. Molecular Systems Biology. 2016;12(12):890
  16. 16. Masamha CP et al. CFIm25 links alternative polyadenylation to glioblastoma tumour suppression. Nature. 2014;510(7505):412-416
  17. 17. Matera AG, Wang Z. A day in the life of the spliceosome. Nature Reviews Molecular Cell Biology. 2014;15(2):108-121
  18. 18. Turunen JJ et al. The significant other: Splicing by the minor spliceosome. Wiley Interdisciplinary Reviews: RNA. 2013;4(1):61-76
  19. 19. Will CL, Lührmann R. Spliceosome structure and function. Cold Spring Harbor Perspectives in Biology. 2011;3(7):a003707
  20. 20. Sheth N et al. Comprehensive splice-site analysis using comparative genomics. Nucleic Acids Research. 2006;34(14):3955-3967
  21. 21. Yeo G, Burge CB. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. RECOMB 03: Proceedings of the Seventh Annual International Conference on Research in Computational Molecular Biology. 2003. DOI: 10.1145/640075.640118
  22. 22. Memczak S et al. Circular RNAs are a large class of animal RNAs with regulatory potency. Nature. 2013;495(7441):333-338
  23. 23. Salzman J et al. Circular RNAs are the predominant transcript isoform from hundreds of human genes in diverse cell types. PLoS One. 2012;7(2):e30733
  24. 24. Hansen TB et al. Natural RNA circles function as efficient microRNA sponges. Nature. 2013;495(7441):384-388
  25. 25. Berg MG et al. U1 snRNP determines mRNA length and regulates isoform expression. Cell. 2012;150(1):53-64
  26. 26. Almada AE et al. Promoter directionality is controlled by U1 snRNP and polyadenylation signals. Nature. 2013;499(7458):360-363
  27. 27. Alfonso-Gonzalez C et al. Sites of transcription initiation drive mRNA isoform selection. Cell. 2023;186(11):2438-2455 e22
  28. 28. Chan JJ et al. Pan-cancer pervasive upregulation of 3′ UTR splicing drives tumourigenesis. Nature Cell Biology. 2022;24(6):928-939
  29. 29. Leppek K, Das R, Barna M. Functional 5′ UTR mRNA structures in eukaryotic translation regulation and how to find them. Nature Reviews Molecular Cell Biology. 2018;19(3):158-174
  30. 30. Mayr C. What are 3′ UTRs doing? Cold Spring Harbor Perspectives in Biology. 2019;11(10):a034728
  31. 31. Braunschweig U et al. Widespread intron retention in mammals functionally tunes transcriptomes. Genome Research. 2014;24(11):1774-1786
  32. 32. Yan Q et al. Systematic discovery of regulated and conserved alternative exons in the mammalian brain reveals NMD modulating chromatin regulators. Proceedings of the National Academy of Sciences. 2015;112(11):3445-3450
  33. 33. Gabut M et al. An alternative splicing switch regulates embryonic stem cell pluripotency and reprogramming. Cell. 2011;147(1):132-146
  34. 34. Wong JJ-L et al. Orchestrated intron retention regulates normal granulocyte differentiation. Cell. 2013;154(3):583-595
  35. 35. Brumbaugh J et al. Nudt21 controls cell fate by connecting alternative polyadenylation to chromatin signaling. Cell. 2018;172(1):106-120 e21
  36. 36. Sandberg R et al. Proliferating cells express mRNAs with shortened 3 untranslated regions and fewer microRNA target sites. Science. 2008;320(5883):1643-1647
  37. 37. Krawczak M, Reiss J, Cooper DN. The mutational spectrum of single base-pair substitutions in mRNA splice junctions of human genes: Causes and consequences. Human Genetics. 1992;90:41-54
  38. 38. Verhaart IE et al. A multi-source approach to determine SMA incidence and research ready population. Journal of Neurology. 2017;264:1465-1473
  39. 39. Gruber AJ, Zavolan M. Alternative cleavage and polyadenylation in health and disease. Nature Reviews Genetics. 2019;20(10):599-614
  40. 40. Xia Z et al. Dynamic analyses of alternative polyadenylation from RNA-seq reveal a 3′-UTR landscape across seven tumour types. Nature Communications. 2014;5(1):5274
  41. 41. Mayr C, Bartel DP. Widespread shortening of 3′ UTRs by alternative cleavage and polyadenylation activates oncogenes in cancer cells. Cell. 2009;138(4):673-684
  42. 42. Lee S-H et al. Widespread intronic polyadenylation inactivates tumour suppressor genes in leukaemia. Nature. 2018;561(7721):127-131
  43. 43. Tan S et al. NUDT21 negatively regulates PSMB2 and CXXC5 by alternative polyadenylation and contributes to hepatocellular carcinoma suppression. Oncogene. 2018;37(35):4887-4900
  44. 44. Huang J et al. Suppression of cleavage factor Im 25 promotes the proliferation of lung cancer cells through alternative polyadenylation. Biochemical and Biophysical Research Communications. 2018;503(2):856-862
  45. 45. Xiong M et al. NUDT21 inhibits bladder cancer progression through ANXA2 and LIMK2 by alternative polyadenylation. Theranostics. 2019;9(24):7156
  46. 46. Xing Y et al. Downregulation of NUDT21 contributes to cervical cancer progression through alternative polyadenylation. Oncogene. 2021;40(11):2051-2064
  47. 47. Jayasinghe RG et al. Systematic analysis of splice-site-creating mutations in cancer. Cell Reports. 2018;23(1):270-281 e3
  48. 48. Jung H et al. Intron retention is a widespread mechanism of tumor-suppressor inactivation. Nature Genetics. 2015;47(11):1242-1248
  49. 49. Quesada V et al. Exome sequencing identifies recurrent mutations of the splicing factor SF3B1 gene in chronic lymphocytic leukemia. Nature Genetics. 2012;44(1):47-52
  50. 50. Yoshida K et al. Frequent pathway mutations of splicing machinery in myelodysplasia. Nature. 2011;478(7367):64-69
  51. 51. Graubert TA et al. Recurrent mutations in the U2AF1 splicing factor in myelodysplastic syndromes. Nature Genetics. 2012;44(1):53-57
  52. 52. Seiler M et al. Somatic mutational landscape of splicing factor genes and their functional consequences across 33 cancer types. Cell Reports. 2018;23(1):282-296 e4
  53. 53. Koh CM et al. MYC regulates the core pre-mRNA splicing machinery as an essential step in lymphomagenesis. Nature. 2015;523(7558):96-100
  54. 54. Kahles A et al. Comprehensive analysis of alternative splicing across tumors from 8,705 patients. Cancer Cell. 2018;34(2):211-224 e6
  55. 55. Ren X et al. Pervasive Intronic Polyadenylation Serves as a Potential Source of Cancer Neoantigens. 2022. DOI: 10.21203/rs.3.rs-1537870/v1
  56. 56. Li Z et al. An isoform-resolution transcriptomic atlas of colorectal cancer from long-read single-cell sequencing. bioRxiv. 2023 04.21.536771
  57. 57. Wang ET et al. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456(7221):470-476
  58. 58. Merkin J et al. Evolutionary dynamics of gene and isoform regulation in mammalian tissues. Science. 2012;338(6114):1593-1599
  59. 59. Barbosa-Morais NL et al. The evolutionary landscape of alternative splicing in vertebrate species. Science. 2012;338(6114):1587-1593
  60. 60. Mazin PV et al. Alternative splicing during mammalian organ development. Nature Genetics. 2021;53(6):925-934
  61. 61. Zhao Z et al. Cancer-associated dynamics and potential regulators of intronic polyadenylation revealed by IPAFinder using standard RNA-seq data. Genome Research. 2021;31(11):2095-2106
  62. 62. Salamov AA, Solovyev VV. Recognition of 3′-processing sites of human mRNA precursors. Bioinformatics. 1997;13(1):23-28
  63. 63. Shiraki T et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proceedings of the National Academy of Sciences. 2003;100(26):15776-15781
  64. 64. Barth TK, Imhof A. Fast signals and slow marks: The dynamics of histone modifications. Trends in Biochemical Sciences. 2010;35(11):618-626
  65. 65. The FANTOM Consortium and the RIKEN PMI and CLST (DGT). A promoter-level mammalian expression atlas. Nature. 2014;507(7493):462-470
  66. 66. Zhou J et al. Annotating TSSs in multiple cell types based on DNA sequence and RNA-seq data via DeeReCT-TSS. Genomics, Proteomics & Bioinformatics. 2022;20(5):959-973
  67. 67. Hoque M et al. Analysis of alternative cleavage and polyadenylation by 3′ region extraction and deep sequencing. Nature Methods. 2013;10(2):133-139
  68. 68. Jan CH et al. Formation, regulation and evolution of Caenorhabditis elegans 3′ UTRs. Nature. 2011;469(7328):97-101
  69. 69. Martin G et al. Genome-wide analysis of pre-mRNA 3′ end processing reveals a decisive role of human cleavage factor I in the regulation of 3′ UTR length. Cell Reports. 2012;1(6):753-763
  70. 70. Derti A et al. A quantitative atlas of polyadenylation in five mammals. Genome Research. 2012;22(6):1173-1183
  71. 71. Wang R et al. PolyA_DB 3 catalogs cleavage and polyadenylation sites identified by deep sequencing in multiple genomes. Nucleic Acids Research. 2018;46(D1):D315-D319
  72. 72. Herrmann CJ et al. PolyASite 2.0: A consolidated atlas of polyadenylation sites from 3′ end sequencing. Nucleic Acids Research. 2020;48(D1):D174-D179
  73. 73. Arefeen A et al. TAPAS: Tool for alternative polyadenylation site analysis. Bioinformatics. 2018;34(15):2521-2529
  74. 74. Ha KC, Blencowe BJ, Morris Q. QAPA: A new method for the systematic analysis of alternative polyadenylation from RNA-seq data. Genome Biology. 2018;19:1-18
  75. 75. Chang J-W et al. mRNA 3′-UTR shortening is a molecular signature of mTORC1 activation. Nature Communications. 2015;6(1):7218
  76. 76. Ye C et al. APAtrap: Identification and quantification of alternative polyadenylation sites from RNA-seq data. Bioinformatics. 2018;34(11):1841-1849
  77. 77. Li L et al. An atlas of alternative polyadenylation quantitative trait loci contributing to complex trait and disease heritability. Nature Genetics. 2021;53(7):994-1005
  78. 78. Lusk R et al. Aptardi predicts polyadenylation sites in sample-specific transcriptomes using high-throughput RNA sequencing and DNA sequence. Nature Communications. 2021;12(1):1652
  79. 79. Tabaska JE, Zhang MQ. Detection of polyadenylation signals in human DNA sequences. Gene. 1999;231(1-2):77-86
  80. 80. Liu H et al. An in-silico method for prediction of polyadenylation signals in human sequences. Genome Informatics. 2003;14:84-93
  81. 81. Cheng Y, Miura RM, Tian B. Prediction of mRNA polyadenylation sites by support vector machine. Bioinformatics. 2006;22(19):2320-2325
  82. 82. Xie B et al. Poly (A) motif prediction using spectral latent features from human DNA sequences. Bioinformatics. 2013;29(13):i316-i325
  83. 83. Xia Z et al. DeeReCT-PolyA: A robust and generic deep learning method for PAS identification. Bioinformatics. 2019;35(14):2371-2379
  84. 84. Yu H, Dai Z. SANPolyA: A deep learning method for identifying Poly (A) signals. Bioinformatics. 2020;36(8):2393-2400
  85. 85. Li Z et al. DeeReCT-APA: Prediction of alternative polyadenylation site usage through deep learning. Genomics, Proteomics & Bioinformatics. 2022;20(3):483-495
  86. 86. Stroup EK, Ji Z. Deep learning of human polyadenylation sites at nucleotide resolution reveals molecular determinants of site usage and relevance in disease. Nature Communications. 2023;14(1):7378
  87. 87. Desmet F-O et al. Human splicing finder: An online bioinformatics tool to predict splicing signals. Nucleic Acids Research. 2009;37(9):e67-e67
  88. 88. Barash Y et al. Deciphering the splicing code. Nature. 2010;465(7294):53-59
  89. 89. Cereda M et al. RNAmotifs: Prediction of multivalent RNA motifs that control alternative splicing. Genome Biology. 2014;15:1-12
  90. 90. Dror G, Sorek R, Shamir R. Accurate identification of alternatively spliced exons using support vector machine. Bioinformatics. 2005;21(7):897-901
  91. 91. Jian X, Boerwinkle E, Liu X. In silico prediction of splice-altering single nucleotide variants in the human genome. Nucleic Acids Research. 2014;42(22):13534-13544
  92. 92. Mort M et al. MutPred splice: Machine learning-based prediction of exonic variants that disrupt splicing. Genome Biology. 2014;15:1-20
  93. 93. Jaganathan K et al. Predicting splicing from primary sequence with deep learning. Cell. 2019;176(3):535-548 e24
  94. 94. Cheng J et al. MMSplice: Modular modeling improves the predictions of genetic variant effects on splicing. Genome Biology. 2019;20:1-15
  95. 95. Rosenberg AB et al. Learning the sequence determinants of alternative splicing from millions of random sequences. Cell. 2015;163(3):698-711
  96. 96. Zeng T, Li YI. Predicting RNA splicing from DNA sequence using pangolin. Genome Biology. 2022;23(1):1-18
  97. 97. Strauch Y et al. CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites. PLoS One. 2022;17(6):e0269159
  98. 98. Baeza-Centurion P et al. Combinatorial genetics reveals a scaling law for the effects of mutations on splicing. Cell. 2019;176(3):549-563 e23
  99. 99. Wagner N et al. Aberrant splicing prediction across human tissues. Nature Genetics. 2023;55(5):861-870
  100. 100. Rentzsch P et al. CADD-splice—Improving genome-wide variant effect prediction using deep learning-derived splice scores. Genome Medicine. 2021;13(1):1-12
  101. 101. Avsec Ž et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods. 2021;18(10):1196-1203
  102. 102. Ji Y et al. DNABERT: Pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics. 2021;37(15):2112-2120
  103. 103. Nguyen E et al. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. Advances in Neural Information Processing Systems. 2024;36
  104. 104. Glinos DA et al. Transcriptome variation in human tissues revealed by long-read sequencing. Nature. 2022;608(7922):353-359

Written By

Bin Zhang and Chencheng Xu

Submitted: 29 February 2024 Reviewed: 29 February 2024 Published: 24 May 2024