Abstract
The completion of the draft and complete human genome has revealed that there are only around 20,000 genes encoding proteins. Nonetheless, these genes can generate eight times more RNA transcript isoforms, while this number is still growing with the accumulation of high-throughput RNA sequencing (RNA-seq) data. In general, over 90% of genes generate various RNA isoforms emerging from variations at the 5′ and 3′ ends, as well as different exon combinations, known as alternative transcription start site (TSS), alternative polyadenylation (APA), and alternative splicing (AS). In this chapter, our focus will be on introducing the significance of these three types of isoform variations in gene regulation and their underlying molecular mechanisms. Additionally, we will highlight the historical, current, and prospective technological advancements in elucidating isoform regulations, from both the computational side such as deep-learning-based artificial intelligence, and the experimental aspect such as the long-read third-generation sequencing (TGS).
Keywords
- gene regulation
- RNA isoform
- RNA-seq
- next generation sequencing
- third generation sequencing
- transcription start site
- alternative splicing
- alternative polyadenylation
- deep learning
1. Introduction
Since the initial release of the human genome draft in 2001 [1, 2], gaps or unplaced sequences in the genome have been solved continuously. In 2022, the telomere-to-telomere (T2T) consortium published the first complete sequence of a human genome [3]. With these genome sequences as the reference, genes have been annotated accordingly and now it is well accepted that there are only around 20,000 genes encoding proteins in the human genome. For instance, based on the GENCODE annotation database [4], the number of protein-coding gene (PCG) has been almost invariable in the last decade (Figure 1A). However, the number of annotated RNA transcripts transcribed from these genes gradually increased (Figure 1B). In general, over 90% of them transcribed multiple RNA transcripts known as isoforms, with variations at 5′ and 3′ end, as well as different exon combinations (Figure 1C).
The variation at the 5′ and 3′ end of isoforms are known as alternative transcription start site (TSS) and alternative polyadenylation (APA), respectively, whereas isoform variations formed by different exon combinations are mediated by alternative splicing (AS). The three types of variations occur at different stages of RNA processing. TSS selection happens when transcription is initiated, and RNA splicing is a posttranscriptional or co-transcriptional process [5]. During transcription termination, nascent RNA molecules undergo cleavage and the addition of poly-adenosine (poly-A) tails, known as polyadenylation [6]. APA involves utilizing varied polyadenylation sites (PASs) to generate isoforms with distinct 3′ ends. In Section 2, we will introduce the molecular mechanisms underlying these three types of variations.
The variable RNA isoforms not only enable the limited number of genes to generate a much larger number of proteins but also greatly increase the complexity of gene regulation even when variations only impact the noncoding sequences of RNA. Dysregulated isoform variations contribute significantly to pathogenesis as they impair the tightly controlled gene regulations. For instance, approximately 15–50% of human genetic disorders are caused by mutations impairing RNA splicing [7]. In cancer, mutations impairing splicing are also frequent, resulting in widespread dysregulated splicing events. Besides, cancer cells expressed isoforms utilizing distinct PAS. In addition, pervasive regulations of TSS through alternative promoters have been observed in tumor samples [8]. These dysregulations cover both coding sequence (CDS) and noncoding sequence of RNA, such as untranslated region (UTR) and intron. The detailed functional consequence of isoform variations at the molecular and cellular level will be elucidated in Section 3 and their dysregulations in human disease will be introduced in Section 4.
In the last two decades, the next generation sequencing (NGS), has revolutionized RNA profiling for different cell types or cells under different conditions [9]. While the primary goal of the profiling is to quantify expression at the gene level, there are still many endeavors focusing on isoform variations, mainly for alternative splicing since its quantification is easier than that of TSS and APA. Hence, specialized experimental protocols have been designed to more effectively capture TSS or PAS. Computational methods have also been developed for detecting them from conventional RNA sequencing (RNA-seq) data despite the performances remaining far from satisfactory.
In the last section of this chapter, we will list the impactive experimental and computational methods for identifying and quantifying the three types of isoform variations. In addition, with the technological advancements of machine learning, enormous methods and models have been proposed for predicting alternative TSS, splicing, and polyadenylation using DNA/RNA sequences as inputs. Their performance continuously improved and many tools become promising for evaluating impact of genetic variants in isoform variations and further predicting the disease risks. However, all these methods are limited to variations at the event level, and overlook the full-length sequence of isoforms. Consequently, at the end of this chapter, we will highlight the powerfulness of third-generation sequencing (TGS) in detecting full-length isoform variations and its potential for training the nature of intact RNA molecule-centric deep-learning models.
2. Molecular mechanisms underlying RNA isoform variations
Gene regulation usually relies on the binding or recruitment of trans-acting factors on cis- elements resigning in DNA or RNA. The regulation of TSS, polyadenylation, and splicing will be introduced separately. At the end of this section, we will highlight the links among these three types of regulations.
2.1 TSS regulation
TSS is determined by the assembly of a transcriptional initiation complex comprising transcriptional machinery such as RNA polymerase and general transcription factors (GTFs) at core promoters [10]. Promoters share various common sequence features, as the most well-documented motif is the TATA-box (TATAWAW) first identified in 1978 [11]. Furthermore, the assembly process is regulated by transcription factors (TFs) and their cofactors that bind to enhancers. Recent studies revealed that these factors form condensates, a liquid-phase-like membrane-less organelle, to coordinate transcription initiation and elongation [12]. Different TFs have distinct sequence binding preferences in DNA, thereby shaping the varied landscape of activated promoters across different cell types and tissues. In addition, whether to utilize a promoter for transcribing RNA is also controlled by epigenetic information, such as nucleosome-free positions, DNA methylation, and posttranslational modifications on histones. Overall, the availability of different TFs and heterogenous promoter epigenetic modifications together regulates alternative TSS utilization (Figure 2A).
2.2 Polyadenylation and alternative polyadenylation (APA)
On the other hand, polyadenylation is mainly regulated by several groups of RNA-binding proteins (RBPs), including cleavage and polyadenylation specificity factors (CPSF), cleavage stimulation factors (CstF), and so on (Figure 2B). During transcription termination, RNA polymerase pauses downstream of the cleavage/polyadenylation site (PAS), thereafter CPSF and CstF complex are recruited
2.3 Splicing and alternative splicing
Nevertheless, splicing regulation is even more sophisticated, in which spliceosome, the largest protein complex in human cells that also consists of small nuclear RNAs (snRNA), recognizes donor and acceptor splice sites (SSs) in RNA to define exons [17]. The core component of spliceosome is small nuclear ribonucleoproteins (snRNPs) consisting of snRNA and Sm protein or like Sm (LSm) proteins. The U1, U2, U4, U5, and U6 snRNPs regulate >99% of splicing events, while their variant, so-called minor spliceosome formed by the U11, U12, U4atac and U6atac, and U5 regulate the rest of splicing events [18, 19]. snRNPs difference of the two spliceosomes result in two types of introns. The major ones or U2-types start with almost invariable GU dinucleotides at the 5′ splice site (SS) and end with AG dinucleotides at the 3′ SS, whereas U12-type introns start with GU or AU and end with AC or AG [20]. Classically, the 9 nt sequence (−3 to +6) around 5′ SS, and the 23 nt sequence (−20 to +3) around 3′ SS are used for calculating splicing strength, respectively [21].
During RNA maturation, introns are removed, and the two flanking exons are joined together. However, multiple SSs from the same gene can compete with each other, resulting in variable exon definitions and leading to complicated forms of alternative splicing, such as skipping single or multiple exons (SE), retaining introns (RI), and utilizing non-canonical 5′ SS or 3′ SS (A5SS and A3SS) (Figure 2C). Even more surprisingly, studies revealed that downstream exons can be back-spliced to join upstream exons, which generates RNA circles or so-called circular RNA [22, 23, 24]. In addition to the core spliceosome, other trans-factors also contribute to splicing
2.4 Crosstalk among the three types of variations
Despite the distinct molecular mechanisms underlying alternative TSS, polyadenylation, and splicing, the regulation of these variations is not fully independent. Intuitively, utilizing alternative TSS is associated with alternative splicing of the first exon, while alternative splicing of the last exon impacts the choice of polyadenylation site. In addition, the PAS signal is present in almost every intron in the human genome, suggesting a balance between removing intron by splicing or termination of transcription by polyadenylation. Indeed, studies reveal that U1 snRNP protects pre-mRNA from drastic pre-mature termination at PAS in introns [25], and U1 motifs and PAS signals together shape the landscape of promoter directions [26]. Interestingly, a recent study reveals that the choice of 3′ ends for polyadenylation is globally influenced by the selection of TSS [27], suggesting couplings among the regulation for three types of variations.
3. Functional consequence of RNA isoform variations
In general, isoform variations have three kinds of consequences, including (a) altering open reading frame (ORF) including truncation, (b) changing noncoding sequence of RNA such as untranslated regions (UTR), and (c) triggering mRNA degradation
While ORF alterations can drastically impact functions of the encoded protein, noncoding sequences changes mainly regulate the final protein production. For instance, varied 5′ UTR affect mRNA translation efficiency [29], while 3′ UTR variations regulate mRNA stability and translation efficiency [30]. mRNA degradation
3.1 Isoform variations regulate biological processes at the cellular level
Since isoform variations or switches have broad impacts on gene expression, they induce diverse functional consequences at the cellular level. An interesting example is about cell stemness and differentiation, where splicing has been the most extensively investigated. For instance, a conserved AS event controls the inclusion or exclusion of a cassette exon in FOXP1 that encodes the protein domain for DNA binding preference, further regulating embryonic stem cell pluripotency and reprogramming [33]. Besides, intron retention changes in a group of genes, including Lmnb1, regulate granulocyte differentiation
In addition to AS, APA also contributes significantly to cell fate commitment, as evident by the perturbation of NUDT21, an APA regulator controlling 3′ UTR lengthening or shortening for thousands of transcripts [35]. Beyond differentiation, AS and APA are both capable of regulating cell proliferation and cell cycle progression. An interesting phenomenon is that active proliferated cells tend to express isoform with shortened 3′ UTR
4. Dysregulated RNA isoform variations in disease
As RNA isoform variation acts as a critical layer of gene regulation that is tightly controlled across different cell types and under different conditions, its dysregulation can disrupt gene functions and lead to diseases. In this section, we will mainly introduce isoform dysregulations in human genetic disorders and cancers. The aberrant splicing caused genetic disorders, can be classified into two classes, affecting cis-element responsible for appropriate splicing, and disrupting trans-factors regulating splicing such as components in spliceosome. The former includes mutations directly disrupting splice sites, accounting for an estimated 15% of human genetic disorders [37], as well as mutations that affect other splicing-related cis-elements. In total, it has been proposed that 50% or even 60% of human diseases are caused by these cis-element mutations [7]. For the mutations impairing trans-factors regulating splicing, an extensively studied example is spinal muscular atrophy (SMA). SMA is caused by mutations in the SMN1 gene, which affects around 1/4000 to 1/16,000 births worldwide [38]. SMN1 gene encodes SMN, a protein responsible for snRNP assembly, and thus its mutations should impair splicing globally. Compared to splicing, diseases caused by dysregulated polyadenylation are much less discovered. Still, studies reported several mutations disrupting PAS signal and causing hematological disorders [39].
Nevertheless, dysregulation for all three types of isoform variations has been observed in cancer. A study systematically quantifying promoter activities by analyzing 18,468 RNA-seq samples across 42 cancer types reveals widespread utilization of alternative promoters [8]. In line with the observation that active proliferated cells are preferentially expressing isoform with shortened 3′ UTR [36], 3′ UTR are globally shortened in cancer cells
In addition to alternative TSS and APA, splicing has been much more extensively studied in cancer at both the DNA level and RNA level. Mutations disrupting canonical splice sites have been identified in various essential cancer-related genes, such as TP53 and BRCA1 [47], and this is a widespread mechanism for inactivating tumor suppressors [48]. Besides, recurrent somatic mutations in splicing factors have also been identified with representative studies showing frequent mutations in genes SF3B1, SRSF2, U2AF1, and ZRSR2 in hematopoietic malignancy, such as myelodysplastic syndromes (MDS) and leukemia [49, 50, 51]. A more recent study characterizing splicing factor mutations across 33 cancer types demonstrates that hotspot mutations of SF3B1 and U2AG1 are also frequent in multiple solid tumors as well [52]. Beyond mutations in splicing factors, MYC, a well-known proto-oncogene encoding a transcription factor, indirectly regulates splicing by promoting the expression of core spliceosome components as an essential step in lymphomagenesis [53]. These studies suggest potential high frequent splicing dysregulations in cancer. Indeed, comprehensive analysis of AS based on RNA-seq samples from 8705 cancer patients reveals more active AS in tumors compared to normal samples, as well as hundreds of aberrant splicing with novel exon-exon junctions that are not present in normal samples [54].
Taken together, these studies demonstrate the high frequency of dysregulated isoform variations in diseases, while aberrant splicing is the most evident. Many of these dysregulations disrupt gene functions by isoform switching and contribute significantly to pathogenesis. On the other hand, they may also serve as therapeutic targets for treatment. For instance, an antisense oligonucleotide drug modulating AS of gene SMN2 has been approved by the FDA for SMA treatment. Additionally, dysregulated isoform variations in cancer that induce novel ORFs such as splicing and IPA may serve as sources of neoantigens [55, 56], which are promising to be utilized for cancer vaccine development.
5. Technology advancement in quantifying and predicting isoform variations
The NGS is a technology that is capable of parallelly sequencing massive DNA fragments, up to hundreds of millions in one experiment. Applying NGS to complementary DNA (cDNA) libraries reverse transcribed from RNA, known as RNA-seq, has revolutionized gene expression profiling. In addition, RNA-seq is also efficient in detecting and quantifying alternative splicing events with reads (short fragment sequenced by the NGS) supporting splice junctions. An extensively used metric is the percentage of splicing in (PSI) that is calculated by reads supporting splicing in divided by the sum of reads supporting both splicing in and out, measuring the proportion of isoforms with splicing in. With this, global AS profiling has been performed across different tissues and species [57, 58, 59], as well as organs at different development stages [60].
As detecting and quantifying AS events from RNA-seq data are relatively straightforward, the computational methods for AS mainly focus on differential splicing analysis. rMATs and DEXSeq are two representative methods that have been widely utilized [61, 62]. However, accurate identification and quantification of alternative TSS and APA directly from RNA-seq data is more challenging compared to AS. To this end, various experimental methods have been developed based on TSS and PAS properties. We will briefly list these experimental methods together with the effects of computational approaches for accurate identification and quantification of TSS and PAS expression in Section 5.1. On the other hand, distinct cis-elements determine the definition and utilization of TSS, PAS, and splice site, endeavors have been conducted to predict them
5.1 Identifying and quantifying alternative TSS and PAS from NGS data
There are two types of high-throughput experimental approaches for comprehensive TSS profiling. As mature transcripts transcribed by RNA polymerase II have a specific cap-like structure at the 5′ ends, cap analysis gene expression (CAGE) is a method capable of enriching 5′ ends of RNA. CAGE followed by massive parallel sequencing (CAGE-seq) is a high-throughput approach for transcriptome-wide TSS profiling [63]. Additionally, active promoter regions in chromatin are enriched with several specific histone modification markers, including tri-methylation on histone 3 lysine 4 (H3K4me3), acetylation on histone 3 lysine 9 (H3K9ac), and histone 3 lysine 27 (H3K27ac). The second type of method for inferring TSS utilization is based on chromatin immunoprecipitation (ChIP) assays with sequencing (ChIP-seq) for these three markers, even though with relatively lower resolution [64]. With CAGE-seq, the Functional ANnoTation of the Mammalian Genome (FANTOM) project identified 201,802 putative TSS across dozens of cell lines [65]. Among them, 70% (143,200) are from genic regions, while only 40% (56,793 out of 143,200) are associated with annotated transcripts based on GENCODE (Figure 4A and B), suggesting that the current transcript isoform annotation might be still not comprehensive. On the other hand, computational methods have also been developed to annotate and quantify TSS from RNA-seq data utilizing splice junctions across the first and second exons [8], or RNA-seq coverages together with sequence features [66].
For global PAS identification and APA quantification, dozens of experimental protocols have been developed. The majority of them are designed by enriching fragments of transcript 3′ end close to or comprising poly-A tail, such as 3P-seq, Aseq, PolyA-seq, and 3′ READS [67, 68, 69, 70]. Accordingly, two widely used datasets have been constructed using data from these experimental 3′ end sequencing (3′ end-seq) approaches, including PolyASite and PolyA_DB [71, 72]. 53% of PolyASite (v2) PAS and 64% of PolyA_DB (v3) PAS are from genic region (Figure 4C). Even though the two databases are both obtained from 3′ end-seq data, the overlapped ones only account for 39% of the sites from PolyASite and 66% of the sites from PolyA_DB, suggesting the heterogenous of different 3′ end-seq protocols, for instance PolyA_DB uses the data from 3’ READS, while PolyASite uses the data from 3P-seq, Aseq, PolyA-seq et al. Moreover, 86% of PolyASite PAS and 79% of PolyA_DB PAS are not associated with any annotated transcript in GENCODE, again indicating the isoform annotation is far from complete (Figure 3D). In addition to experimental approaches, computational method for directly identifying PAS and quantifying APA from RNA-seq data are also feasible, while just the efficiency and accuracy are suboptionable. Almost all of them are designed by detecting drop points in RNA-seq coverages along the gene body, while many of them are only able to identify and quantify APA within 3′ UTR, such as TAPAS [73], QAPA [74], GETUTR [75], APAtrap [76], DaPars2 [77], and Aptardi [78]. Besides, IPAFinder is designed specifically for APA within intronic regions from RNA-seq data [61]. Nevertheless, we recently have developed APAIQ , an accurate method capable of transcriptome-wide APA identification and quantification from RNA-seq data, showing much higher precision and recall than previous methods [14].
5.2 Predicting isoform variations with DNA/RNA sequences
Owing to the heterogenous flexibilities of sequence features for TSS, PAS, and splice site, as well as their distinct pathogenic impacts, predicting the three types of isoform variations with DNA/RNA sequence are under different developmental stages and have different focuses. To date, there are many computational methods for predicting TSS but none of them is capable of predicting utilization of alternative TSS from the same gene. Hence, in this section, we will only focus on alternative polyadenylation and splicing. We will briefly introduce the historic computational methods and highlight the recent advancements utilizing artificial intelligence (AI) techniques in predicting these two kinds of variations.
5.2.1 Predicting PAS and APA
The early methods for PAS prediction mainly aim to discriminate true PAS from pseudo-ones that also comprises the PAS motif,
5.2.2 Predicting alternative splicing
Traditional computational methods for predicting splice events typically rely on motifs [87, 88, 89]. These methods assume the existence of characteristic sequences, or motifs, near splice acceptor and donor sites. Any mutations disrupting these motifs can consequently impact splicing events. While motif-based approaches offer intuitiveness and interpretability, they suffer from limitations in coverage and fail to fully capture the intricate regulatory mechanisms governing splicing. To overcome these limitations, some studies have integrated machine learning techniques, such as support vector machines and random forests, with splice event prediction [90, 91, 92]. While these traditional machine learning algorithms perform better than motif-based methods, they heavily rely on feature engineering, limiting their applicability and generalization.
SpliceAI pioneered the prediction of alternative splicing events through end-to-end deep learning models, utilizing gene sequences directly as inputs to estimate the probabilities of each position being an acceptor or donor site [93]. In comparison to predecessors such as MMSplice [94] and HAL [95], SpliceAI substantially extends the input sequence length to over 10,000 bases, facilitating the consideration of a broader range of information surrounding splice sites, especially regulatory elements around splice sites. To effectively process such extensive sequences, a residue-connected dilated convolutional neural network is employed. SpliceAI has exhibited remarkable performance in both splice site identification and the prediction of mutation impacts on alternative splicing events. However, SpliceAI lacks the capacity to differentiate between variations across different tissues and organisms, hindering its generalizability.
Several subsequent studies have endeavored to enhance performance based on SpliceAI, particularly in predicting the effects of mutations on alternative splicing. These efforts involve the integration of data from multiple species and multiple tissues [96], curated alternative splice sites [97], the scaling law [98, 99], and predictions from other tools [100]. Another approach to predicting splice events using deep learning models involves analogizing gene sequences to natural language text and training large language models on existing sequencing data to forecast splicing and mutation effects, as demonstrated by Enformer [101], DNABERT [102], and Hyenadna [103]. These methods rely on transformer-based architectures and large-scale sequencing data, which enable significant extension of the model’s receptive field. For instance, Enformer achieves a receptive field of 100 kb and has demonstrated state-of-the-art performance on multiple gene expression prediction tasks. It is important to note that while these methods are designed for general gene sequence prediction tasks, they may not always perform optimally in predicting splice events.
Despite great progress in methods for identifying and quantifying alternative TSS, splicing, and polyadenylation from RNA-seq data, as well as deep-learning-based methods for predicting isoform variations with DNA/RNA sequence, they are all designed for detecting variations at the event level, rather than variations across different intact RNA isoforms. The TGS, from Pacific Bioscience (PacBio) and Oxford Nanopore Technology (ONT), which enable high-throughput sequencing of DNA or RNA with long read up to 25 kb and 300 kb, respectively, emerge as a powerful tool for full-length RNA isoform profiling. In 2022, a study implemented ONT to 88 samples from the genotype-tissue expression (GTEx) tissues and cell lines, revealing significant couplings between multiple alternative splicing events across isoforms and identified allelically specific utilization of isoforms [104]. This highlights the significance of utilizing full-length isoforms in assessing the outcome of genetic variants. Moreover, ~60% of TSSs captured by CAGE-seq and over 80% of PAS identified by 3′ end-seq data are not associated with any transcripts in GENCODE, suggesting the current isoform annotation is far from complete, raising the unmet needs for isoform identification with TGS data. With the accumulation of full-length isoform profiling data and the advancement of AI techniques, it is anticipated to see isoform centric deep-learning models that encompass all types of variation events and predict the outcome at the nature intact RNA molecular level.
References
- 1.
Olsen UD et al. Initial sequencing and analysis of the human genome. Nature. 2001; 409 (6822):860-921 - 2.
Venter JC et al. The sequence of the human genome. Science. 2001; 291 (5507):1304-1351 - 3.
Nurk S et al. The complete sequence of a human genome. Science. 2022; 376 (6588):44-53 - 4.
Harrow J et al. GENCODE: The reference human genome annotation for the ENCODE project. Genome Research. 2012; 22 (9):1760-1774 - 5.
Tilgner H et al. Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs. Genome Research. 2012; 22 (9):1616-1625 - 6.
Proudfoot NJ. Transcriptional termination in mammals: Stopping the RNA polymerase II juggernaut. Science. 2016; 352 (6291):aad9926 - 7.
Wang G-S, Cooper TA. Splicing in disease: Disruption of the splicing code and the decoding machinery. Nature Reviews Genetics. 2007; 8 (10):749-761 - 8.
Demircioğlu D et al. A pan-cancer transcriptome analysis reveals pervasive regulation through alternative promoters. Cell. 2019; 178 (6):1465-1477 e17 - 9.
Wang Z, Gerstein M, Snyder M. RNA-Seq: A revolutionary tool for transcriptomics. Nature Reviews Genetics. 2009; 10 (1):57-63 - 10.
Haberle V, Stark A. Eukaryotic core promoters and the functional basis of transcription initiation. Nature Reviews Molecular Cell Biology. 2018; 19 (10):621-637 - 11.
Lifton R et al. The organization of the histone genes in Drosophila melanogaster : Functional and evolutionary implications. In: Cold Spring Harbor symposia on Quantitative Biology. Cold Spring Harbor Laboratory Press; 1978. DOI: 10.1101/SQB.1978.042.01.105 - 12.
Cramer P. Organization and regulation of gene transcription. Nature. 2019; 573 (7772):45-54 - 13.
Proudfoot N, Brownlee G. 3′ non-coding region sequences in eukaryotic messenger RNA. Nature. 1976; 263 (5574):211-214 - 14.
Long Y et al. Accurate transcriptome-wide identification and quantification of alternative polyadenylation from RNA-seq data with APAIQ. Genome Research. 2023; 33 (4):644-657 - 15.
Xiao MS et al. Global analysis of regulatory divergence in the evolution of mouse alternative polyadenylation. Molecular Systems Biology. 2016; 12 (12):890 - 16.
Masamha CP et al. CFIm25 links alternative polyadenylation to glioblastoma tumour suppression. Nature. 2014; 510 (7505):412-416 - 17.
Matera AG, Wang Z. A day in the life of the spliceosome. Nature Reviews Molecular Cell Biology. 2014; 15 (2):108-121 - 18.
Turunen JJ et al. The significant other: Splicing by the minor spliceosome. Wiley Interdisciplinary Reviews: RNA. 2013; 4 (1):61-76 - 19.
Will CL, Lührmann R. Spliceosome structure and function. Cold Spring Harbor Perspectives in Biology. 2011; 3 (7):a003707 - 20.
Sheth N et al. Comprehensive splice-site analysis using comparative genomics. Nucleic Acids Research. 2006; 34 (14):3955-3967 - 21.
Yeo G, Burge CB. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. RECOMB 03: Proceedings of the Seventh Annual International Conference on Research in Computational Molecular Biology. 2003. DOI: 10.1145/640075.640118 - 22.
Memczak S et al. Circular RNAs are a large class of animal RNAs with regulatory potency. Nature. 2013; 495 (7441):333-338 - 23.
Salzman J et al. Circular RNAs are the predominant transcript isoform from hundreds of human genes in diverse cell types. PLoS One. 2012; 7 (2):e30733 - 24.
Hansen TB et al. Natural RNA circles function as efficient microRNA sponges. Nature. 2013; 495 (7441):384-388 - 25.
Berg MG et al. U1 snRNP determines mRNA length and regulates isoform expression. Cell. 2012; 150 (1):53-64 - 26.
Almada AE et al. Promoter directionality is controlled by U1 snRNP and polyadenylation signals. Nature. 2013; 499 (7458):360-363 - 27.
Alfonso-Gonzalez C et al. Sites of transcription initiation drive mRNA isoform selection. Cell. 2023; 186 (11):2438-2455 e22 - 28.
Chan JJ et al. Pan-cancer pervasive upregulation of 3′ UTR splicing drives tumourigenesis. Nature Cell Biology. 2022; 24 (6):928-939 - 29.
Leppek K, Das R, Barna M. Functional 5′ UTR mRNA structures in eukaryotic translation regulation and how to find them. Nature Reviews Molecular Cell Biology. 2018; 19 (3):158-174 - 30.
Mayr C. What are 3′ UTRs doing? Cold Spring Harbor Perspectives in Biology. 2019; 11 (10):a034728 - 31.
Braunschweig U et al. Widespread intron retention in mammals functionally tunes transcriptomes. Genome Research. 2014; 24 (11):1774-1786 - 32.
Yan Q et al. Systematic discovery of regulated and conserved alternative exons in the mammalian brain reveals NMD modulating chromatin regulators. Proceedings of the National Academy of Sciences. 2015; 112 (11):3445-3450 - 33.
Gabut M et al. An alternative splicing switch regulates embryonic stem cell pluripotency and reprogramming. Cell. 2011; 147 (1):132-146 - 34.
Wong JJ-L et al. Orchestrated intron retention regulates normal granulocyte differentiation. Cell. 2013; 154 (3):583-595 - 35.
Brumbaugh J et al. Nudt21 controls cell fate by connecting alternative polyadenylation to chromatin signaling. Cell. 2018; 172 (1):106-120 e21 - 36.
Sandberg R et al. Proliferating cells express mRNAs with shortened 3 untranslated regions and fewer microRNA target sites. Science. 2008; 320 (5883):1643-1647 - 37.
Krawczak M, Reiss J, Cooper DN. The mutational spectrum of single base-pair substitutions in mRNA splice junctions of human genes: Causes and consequences. Human Genetics. 1992; 90 :41-54 - 38.
Verhaart IE et al. A multi-source approach to determine SMA incidence and research ready population. Journal of Neurology. 2017; 264 :1465-1473 - 39.
Gruber AJ, Zavolan M. Alternative cleavage and polyadenylation in health and disease. Nature Reviews Genetics. 2019; 20 (10):599-614 - 40.
Xia Z et al. Dynamic analyses of alternative polyadenylation from RNA-seq reveal a 3′-UTR landscape across seven tumour types. Nature Communications. 2014; 5 (1):5274 - 41.
Mayr C, Bartel DP. Widespread shortening of 3′ UTRs by alternative cleavage and polyadenylation activates oncogenes in cancer cells. Cell. 2009; 138 (4):673-684 - 42.
Lee S-H et al. Widespread intronic polyadenylation inactivates tumour suppressor genes in leukaemia. Nature. 2018; 561 (7721):127-131 - 43.
Tan S et al. NUDT21 negatively regulates PSMB2 and CXXC5 by alternative polyadenylation and contributes to hepatocellular carcinoma suppression. Oncogene. 2018; 37 (35):4887-4900 - 44.
Huang J et al. Suppression of cleavage factor Im 25 promotes the proliferation of lung cancer cells through alternative polyadenylation. Biochemical and Biophysical Research Communications. 2018; 503 (2):856-862 - 45.
Xiong M et al. NUDT21 inhibits bladder cancer progression through ANXA2 and LIMK2 by alternative polyadenylation. Theranostics. 2019; 9 (24):7156 - 46.
Xing Y et al. Downregulation of NUDT21 contributes to cervical cancer progression through alternative polyadenylation. Oncogene. 2021; 40 (11):2051-2064 - 47.
Jayasinghe RG et al. Systematic analysis of splice-site-creating mutations in cancer. Cell Reports. 2018; 23 (1):270-281 e3 - 48.
Jung H et al. Intron retention is a widespread mechanism of tumor-suppressor inactivation. Nature Genetics. 2015; 47 (11):1242-1248 - 49.
Quesada V et al. Exome sequencing identifies recurrent mutations of the splicing factor SF3B1 gene in chronic lymphocytic leukemia. Nature Genetics. 2012; 44 (1):47-52 - 50.
Yoshida K et al. Frequent pathway mutations of splicing machinery in myelodysplasia. Nature. 2011; 478 (7367):64-69 - 51.
Graubert TA et al. Recurrent mutations in the U2AF1 splicing factor in myelodysplastic syndromes. Nature Genetics. 2012; 44 (1):53-57 - 52.
Seiler M et al. Somatic mutational landscape of splicing factor genes and their functional consequences across 33 cancer types. Cell Reports. 2018; 23 (1):282-296 e4 - 53.
Koh CM et al. MYC regulates the core pre-mRNA splicing machinery as an essential step in lymphomagenesis. Nature. 2015; 523 (7558):96-100 - 54.
Kahles A et al. Comprehensive analysis of alternative splicing across tumors from 8,705 patients. Cancer Cell. 2018; 34 (2):211-224 e6 - 55.
Ren X et al. Pervasive Intronic Polyadenylation Serves as a Potential Source of Cancer Neoantigens. 2022. DOI: 10.21203/rs.3.rs-1537870/v1 - 56.
Li Z et al. An isoform-resolution transcriptomic atlas of colorectal cancer from long-read single-cell sequencing. bioRxiv. 2023 04.21.536771 - 57.
Wang ET et al. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008; 456 (7221):470-476 - 58.
Merkin J et al. Evolutionary dynamics of gene and isoform regulation in mammalian tissues. Science. 2012; 338 (6114):1593-1599 - 59.
Barbosa-Morais NL et al. The evolutionary landscape of alternative splicing in vertebrate species. Science. 2012; 338 (6114):1587-1593 - 60.
Mazin PV et al. Alternative splicing during mammalian organ development. Nature Genetics. 2021; 53 (6):925-934 - 61.
Zhao Z et al. Cancer-associated dynamics and potential regulators of intronic polyadenylation revealed by IPAFinder using standard RNA-seq data. Genome Research. 2021; 31 (11):2095-2106 - 62.
Salamov AA, Solovyev VV. Recognition of 3′-processing sites of human mRNA precursors. Bioinformatics. 1997; 13 (1):23-28 - 63.
Shiraki T et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proceedings of the National Academy of Sciences. 2003; 100 (26):15776-15781 - 64.
Barth TK, Imhof A. Fast signals and slow marks: The dynamics of histone modifications. Trends in Biochemical Sciences. 2010; 35 (11):618-626 - 65.
The FANTOM Consortium and the RIKEN PMI and CLST (DGT). A promoter-level mammalian expression atlas. Nature. 2014; 507 (7493):462-470 - 66.
Zhou J et al. Annotating TSSs in multiple cell types based on DNA sequence and RNA-seq data via DeeReCT-TSS. Genomics, Proteomics & Bioinformatics. 2022; 20 (5):959-973 - 67.
Hoque M et al. Analysis of alternative cleavage and polyadenylation by 3′ region extraction and deep sequencing. Nature Methods. 2013; 10 (2):133-139 - 68.
Jan CH et al. Formation, regulation and evolution of Caenorhabditis elegans 3′ UTRs. Nature. 2011; 469 (7328):97-101 - 69.
Martin G et al. Genome-wide analysis of pre-mRNA 3′ end processing reveals a decisive role of human cleavage factor I in the regulation of 3′ UTR length. Cell Reports. 2012; 1 (6):753-763 - 70.
Derti A et al. A quantitative atlas of polyadenylation in five mammals. Genome Research. 2012; 22 (6):1173-1183 - 71.
Wang R et al. PolyA_DB 3 catalogs cleavage and polyadenylation sites identified by deep sequencing in multiple genomes. Nucleic Acids Research. 2018; 46 (D1):D315-D319 - 72.
Herrmann CJ et al. PolyASite 2.0: A consolidated atlas of polyadenylation sites from 3′ end sequencing. Nucleic Acids Research. 2020; 48 (D1):D174-D179 - 73.
Arefeen A et al. TAPAS: Tool for alternative polyadenylation site analysis. Bioinformatics. 2018; 34 (15):2521-2529 - 74.
Ha KC, Blencowe BJ, Morris Q. QAPA: A new method for the systematic analysis of alternative polyadenylation from RNA-seq data. Genome Biology. 2018; 19 :1-18 - 75.
Chang J-W et al. mRNA 3′-UTR shortening is a molecular signature of mTORC1 activation. Nature Communications. 2015; 6 (1):7218 - 76.
Ye C et al. APAtrap: Identification and quantification of alternative polyadenylation sites from RNA-seq data. Bioinformatics. 2018; 34 (11):1841-1849 - 77.
Li L et al. An atlas of alternative polyadenylation quantitative trait loci contributing to complex trait and disease heritability. Nature Genetics. 2021; 53 (7):994-1005 - 78.
Lusk R et al. Aptardi predicts polyadenylation sites in sample-specific transcriptomes using high-throughput RNA sequencing and DNA sequence. Nature Communications. 2021; 12 (1):1652 - 79.
Tabaska JE, Zhang MQ. Detection of polyadenylation signals in human DNA sequences. Gene. 1999; 231 (1-2):77-86 - 80.
Liu H et al. An in-silico method for prediction of polyadenylation signals in human sequences. Genome Informatics. 2003; 14 :84-93 - 81.
Cheng Y, Miura RM, Tian B. Prediction of mRNA polyadenylation sites by support vector machine. Bioinformatics. 2006; 22 (19):2320-2325 - 82.
Xie B et al. Poly (A) motif prediction using spectral latent features from human DNA sequences. Bioinformatics. 2013; 29 (13):i316-i325 - 83.
Xia Z et al. DeeReCT-PolyA: A robust and generic deep learning method for PAS identification. Bioinformatics. 2019; 35 (14):2371-2379 - 84.
Yu H, Dai Z. SANPolyA: A deep learning method for identifying Poly (A) signals. Bioinformatics. 2020; 36 (8):2393-2400 - 85.
Li Z et al. DeeReCT-APA: Prediction of alternative polyadenylation site usage through deep learning. Genomics, Proteomics & Bioinformatics. 2022; 20 (3):483-495 - 86.
Stroup EK, Ji Z. Deep learning of human polyadenylation sites at nucleotide resolution reveals molecular determinants of site usage and relevance in disease. Nature Communications. 2023; 14 (1):7378 - 87.
Desmet F-O et al. Human splicing finder: An online bioinformatics tool to predict splicing signals. Nucleic Acids Research. 2009; 37 (9):e67-e67 - 88.
Barash Y et al. Deciphering the splicing code. Nature. 2010; 465 (7294):53-59 - 89.
Cereda M et al. RNAmotifs: Prediction of multivalent RNA motifs that control alternative splicing. Genome Biology. 2014; 15 :1-12 - 90.
Dror G, Sorek R, Shamir R. Accurate identification of alternatively spliced exons using support vector machine. Bioinformatics. 2005; 21 (7):897-901 - 91.
Jian X, Boerwinkle E, Liu X. In silico prediction of splice-altering single nucleotide variants in the human genome. Nucleic Acids Research. 2014; 42 (22):13534-13544 - 92.
Mort M et al. MutPred splice: Machine learning-based prediction of exonic variants that disrupt splicing. Genome Biology. 2014; 15 :1-20 - 93.
Jaganathan K et al. Predicting splicing from primary sequence with deep learning. Cell. 2019; 176 (3):535-548 e24 - 94.
Cheng J et al. MMSplice: Modular modeling improves the predictions of genetic variant effects on splicing. Genome Biology. 2019; 20 :1-15 - 95.
Rosenberg AB et al. Learning the sequence determinants of alternative splicing from millions of random sequences. Cell. 2015; 163 (3):698-711 - 96.
Zeng T, Li YI. Predicting RNA splicing from DNA sequence using pangolin. Genome Biology. 2022; 23 (1):1-18 - 97.
Strauch Y et al. CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites. PLoS One. 2022; 17 (6):e0269159 - 98.
Baeza-Centurion P et al. Combinatorial genetics reveals a scaling law for the effects of mutations on splicing. Cell. 2019; 176 (3):549-563 e23 - 99.
Wagner N et al. Aberrant splicing prediction across human tissues. Nature Genetics. 2023; 55 (5):861-870 - 100.
Rentzsch P et al. CADD-splice—Improving genome-wide variant effect prediction using deep learning-derived splice scores. Genome Medicine. 2021; 13 (1):1-12 - 101.
Avsec Ž et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods. 2021; 18 (10):1196-1203 - 102.
Ji Y et al. DNABERT: Pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics. 2021; 37 (15):2112-2120 - 103.
Nguyen E et al. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. Advances in Neural Information Processing Systems. 2024; 36 - 104.
Glinos DA et al. Transcriptome variation in human tissues revealed by long-read sequencing. Nature. 2022; 608 (7922):353-359