Open access peer-reviewed chapter - ONLINE FIRST

Automating Research Problem Framing and Exploration through Knowledge Extraction from Bibliometric Data

Written By

Christian-Daniel Curiac, Mihai Micea, Traian-Radu Plosca, Daniel-Ioan Curiac, Simona Doboli and Alex Doboli

Submitted: 31 January 2024 Reviewed: 13 February 2024 Published: 10 June 2024

DOI: 10.5772/intechopen.1005575

Bibliometrics - An Essential Methodological Tool for Research Projects IntechOpen
Bibliometrics - An Essential Methodological Tool for Research Pro... Edited by Otavio Oliveira

From the Edited Volume

Bibliometrics - An Essential Methodological Tool for Research Projects [Working Title]

Dr. Otavio Oliveira

Chapter metrics overview

12 Chapter Downloads

View Full Metrics

Abstract

The steep rate at which the number of research outputs has been growing (e.g., books, journal articles, conference proceedings, patents, and other work in digital format) produces not only many intriguing opportunities but also significant challenges. Efficiently managing the research outputs in various very large digital databases has become much more difficult and error-prone than before. It is hard to precisely track all published documents in a way that is usable by humans to frame and explore new research problems. Methodologies and software tools are necessary to automate the time and resource-consuming activities in research. This chapter overviews the existing work and envisioned opportunities to automate the analysis of the research outputs available in digital databases to maximize the research quality and impact. A novel system architecture is also suggested to support research problem framing and exploration. The architecture includes smart recommender modules that also address other research activities, like researcher and institution assessment, bibliography recommendation, and research team formation.

Keywords

  • bibliometric data
  • research trend analysis
  • research problem framing
  • research team formation
  • publication latency

1. Introduction

The needs of modern economies and societies have continuously increased the importance of research work in science, engineering, humanities, and social sciences [1, 2, 3]. Research has uncovered new fundamental knowledge and formulated novel theories about the physical world; has learned new insight into living organisms and produced new treatments in the medical fields; has devised new engineering solutions that are faster, cheaper, more robust, and less energy-hungry; and has produced new studies and products that express human nature in both individual and social settings. The broad set of topics tackled by research endeavors has involved an increasing number of researchers and required an increasing amount of resources allocated by societies [4]. Often scientific opportunities and accomplishments capture the attention of the broad public across all demographics, bringing research to the attention of the media and other news-oriented platforms [5, 6, 7]. This situation not only creates unique opportunities and gives research a large visibility but also produces increasing expectations and responsibilities for the outcomes of research work.

The unprecedented pace at which the volume of research outputs has increased not only the produced opportunities but also the difficulty of the challenges must be addressed. Some of the research outputs in digital format include books, book chapters, journal articles, conference proceedings, patents, student theses, technical reports and documents, blogs, web pages, podcasts, design descriptions, computer code, and many more [8, 9, 10]. Ubiquitous, fast, and often free access to research documents in digital format supports a quick dissemination of the result. Democratizing the access to scientific outcomes has enabled interested professionals to stay updated with respect to the state-of-the-art knowledge in a domain and to participate to further improving and extending the knowledge. Moreover, new high-return research problems have emerged across multiple domains, hence requiring researchers from different domains to jointly work on common topics in bioinformatics, biomedical sciences, intelligent manufacturing, AI-driven medicine, and many more [5, 11, 12, 13].

Simultaneously, the challenges of effectively managing the huge volume of research outputs have also become much harder, less accurate, and more error-prone. For example, continuously and precisely keeping track of the published results is likely impossible due to the sheer size of the results and the difficulty of summarizing it in a way that is usable by humans for analysis, reasoning, and creative problem solving [8]. Even if somebody could somehow read all publications produced on a topic, it would be arguably impossible to cognitively produce a mental representation of the work. It is likely that important aspects and features of the work will be missed or misunderstood. Moreover, such a cognitive effort would be unsustainable for a longer period of time. Another issue is that different researchers have other perspectives and understandings of the importance of the research problems as well as diverse preferences and priorities on the available topics [14, 15]. Aggregating these different subjective perspectives into an accurate and comprehensive “map” of a research community is difficult [13]. Finally, as research work also has significant societal importance, like to give an advantage in the market competition, increased security, higher geo-strategic influence, and so on, the positions and roles of researchers and research groups are quite different: some are more altruistic, aiming to address objective needs, for example, scientific problems on understanding fundamental problems in physics, while some pursue subjective needs, which are tightly linked to the goals and objectives of the funding agencies, like research with a heavy commercial character [12]. Exploiting the opportunities while navigating around the difficulties is difficult and unsustainable without proper methodologies and software instruments that support the automated execution of the time and resource-consuming activities in research.

The chapter offers an overview about the envisioned opportunities of using software tools to automate the analysis of the published research outcomes, such as digital books, book chapters, journal articles, conference proceedings, and so on, to maximize the quality and impact of new research activities. A novel system architecture is presented to support two main research activities, problem framing and exploration. The chapter describes the main bibliographic databases used in research work, including their characteristics, like conciseness, relevance, selectivity, and objectivity, as well as the defining features of databases, like IEEE Xplore, PubMed, Web of Science, and others. A discussion of the impact of outdated bibliometric data is given, too. The chapter enumerates research activities that can benefit from automated bibliometric data analysis, that is, publication content categorization, research trend analysis, research gap discovery, researcher and institution assessment, research problem framing, bibliography recommendation, and research team formation. The research knowledge development process is also summarized, as a starting point for creating novel smart recommender systems to aid the activities of knowledge development.

The main contributions of the chapter are as follows:

  • We have discussed an important aspect that must be considered during the creation of new automated tools for using bibliometric databases in research, which is that bibliographic information reflects a past state of research and not the current state. This time lag is important for highly dynamic domains. To mitigate this aspect, prediction techniques should be employed (see Subsection 2.1).

  • We have presented the main bibliometric metadata fields that are available to support the research-related activities (see Table 1).

  • We have described a novel framework for encapsulating insights from bibliometric metadata into research activities (e.g., the multi-recommender system in Figure 1).

Paper Metadata Fieldpublication content categorizationresearch trend analysisresearch gap discoveryresearcher/institution assessmentresearch theme framingbibliography recommendation
title
abstract
keywords
document ID
document type (e.g., review, article)
access (e.g., open, close, unavailable)
author name
author affiliation
author ID
references
citation count
citing patent count
downloads count
year

Table 1.

Metadata Fields Used in research-related activities.

Figure 1.

System architecture for automated research problem framing and exploration.

The chapter has the following structure. Section 2 describes the characteristics of the bibliographic metadata used in summarizing research documents in databases. Section 3 focuses on enumerating the research activities that can benefit from bibliometric information. Section 4 explains the research knowledge development process. A proposed system architecture for automated research problem framing and exploration is offered in Section 5. Conclusions end the chapter.

Advertisement

2. Bibliographic metadata as a research marker

The fast growth in the number of scientific achievements, projects, and researchers, as well as the continuous emergence of new scientific fields, has resulted in an unprecedented expansion in the number of research publications, including journals, conference proceedings, technical reports, patents, dissertations, and books. Comprehensively surveying a very large number of publications becomes increasingly time-consuming, error-prone, and arguably unsustainable without resorting to the already processed, analyzed, and summarized content indexed in bibliographic databases. As they are a collection of uniform descriptors of scientific documents (like metadata records) that usually include specific data elements to identify the publication, for example, title, authors, publisher, year, and additional supporting information, such as keywords, abstract, citation counts, and references, bibliographic databases provide a rich and valuable source of research-related information. The information has the following characteristics:

  • Conciseness and accuracy: Bibliographic records are compact and robust representations of the publications and the research that supports them.

  • Relevance: The information included in bibliographic records provides the means to not only identify the publications but also summarize its scientific content based on publication title, keywords, and abstract.

  • Selectivity: Grounded in the impossibility of following the entire body of scientific publications and in the need to eliminate irrelevant, duplicate, fake, and low-quality publications, major bibliographic databases have devised means to control their content, for example, by selecting and adopting a whitelist of reputable publications.

  • Objectivity: The bibliographic records are usually not subjected to outside influence and contamination.

The mentioned characteristics are the key support for guiding, shaping, and sustaining scientific research, even though some drawbacks, such as the subscription-based access to highly regarded bibliometric databases (e.g., Scopus and Web of Science), a limited access to full-text, the usage of other languages than English, and outdated bibliographic data, are also possible.

2.1 Addressing outdated bibliometric data

Besides the discussed benefits, a possible limitation of bibliometric data characteristic, outdated data, can limit its effective utilization in research-related activities, mainly in the case of fast-paced domains. This limitation has its roots in the time lag δ between the research completion and the moment the related publications are indexed in bibliometric databases, as expressed by:

δ=δMWT+δPL+δPIL,E1

where parameters δMWT, δPL, and δPIL represent the manuscript-writing time, the research publication latency, and the publication indexing latency, respectively.

The manuscript-writing time δMWT is the time interval between completing the research and submitting the related paper. It includes specific activities, including writing the original draft, reviewing, editing, and finding the right publication to submit the manuscript. Its value depends on the authors’ expertise and experience and the type of publication, having lower values for short communications, and which is gradually increased for regular-length papers, book chapters, and books, that is, from a few days or weeks to months or even years.

The publication latency δPL, also known as publication time or editorial handling time, is a specific characteristic of the publication (e.g., journal) to which the research results are submitted for review and publication. It is defined as the average time interval between the moment a manuscript is submitted and the moment the manuscript is published [8]. Since this information is not stored in bibliometric databases, it needs to be either acquired from the publication’s website, if this information is available, or approximated by averaging the time needed to publish the most recent corpus of research papers. In the latter case, the submission date and publication date must be manually extracted from each of the published versions of the papers. In a recent study for biomedical journals, the publication time is between 91 and 639 days [16].

The publication indexing latency δPIL represents the time lag between the manuscript publication and the time it appears in a bibliometric database. It depends not only on the bibliometric database indexing services but also on the time it takes the publication to send the manuscript-related record in the appropriate format to be indexed in the database. This time interval may vary from a few days to a few months.

As a consequence of the delays, bibliometric data do not reflect the present state of scientific research but a past moment. The effect of the delayed representation needs to be mitigated through appropriate techniques to increase the accuracy in estimating and understanding the status of the current scientific research and its related trends. Such techniques can utilize suitable forecasting methods to evaluate the current state from bibliometric data. Hence, we need to derive the number of prediction steps N, which depends on the delay δ and on the time step size Δt. The following formula can be used:

N=δΔt,E2

where function ⌊.⌉ is the rounding function (i.e., nearest integer function), and the two quantities of the right side of the Eq. (2) have the same time measurement unit (e.g., year).

To exemplify how the influence of parameter δ can be reduced by using forecasting techniques, we compared the research trends with and without δ-correction against the real trend (ground truth). We considered the time evolution of occurrences for the key term ‘automatic test pattern generation’ in the bibliographic metadata for papers published in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD). The paper metadata were acquired from the IEEE Xplore database. To identify the given key term in fields ‘Title,’ ‘Keyword,’ and ‘Abstract,’ a TagMe procedure with a link-probability, lp=0.1, was utilized, following the same methodology as the one described in Ref. [8].

Figure 2 plots the time series of the normalized document frequency ndf [17] for the mentioned key term during the years 2013–2023. We started with devising the number of prediction steps N according to Eqs. (1) and (2):

Figure 2.

Trend comparison for the key term ‘Automatic test pattern generation.’

N=δMWT+δPL+δPILΔt=0.2+0.55+0.11=1year,E3

where the manuscript-writing time δMWT and the publication indexing latency δPIL were considered to be about 2 month and 1 month, respectively, and the research publication latency about 0.55, as computed in Ref. [8].

To find the trend for year 2022 without correction (shown as a black line), we considered only the observations from the years 2013–2022 (depicted using light-blue bars). For the trend with correction (the red-dotted line), we added the pink bar having the value obtained using an autoARIMA prediction method with the best parameters, pdq=1,0,0. Finally, the real trend (the blue-dashed line) was computed using the observed value from year 2023 (the dark-blue bar). The figure shows that the trend with δ-correction slope=0.0007647 is closer to the real trend slope=0.0006856 than the uncorrected trend slope=0.0016226.

A more stringent issue refers to the impact that the delays expressed in Eq. (1) have on the evolution of a research community. Let us assume a return model for the published work that follows a diminishing return model, in which the returns, for example, the received attention from the community, including citations, increase until it reaches a maximum followed by a decrease over time [5]. The time delay δMWT mainly affects the authors of a work and close collaborators that have access to the work before being published. In case of high-impact results delayed by a long value δMWT, authors and their collaborators might originate a pool of work to utilize the high-impact results, hence increasing their total expected impact. However, other researchers will not be impacted. A similar comment is valid for the next delay δPL, too. Hence, any derivative work performed during the delay δMWT+δPL is bound to be partially outdated, as it was not using the insight of the high-impact result. While this delay has impact on the usefulness of some derivative work, it can be argued that the impact is not critical. However, the impact of the indexing delay δPIL can be much more disadvantageous for researchers who track the research outputs through bibliometric databases. As many other researchers will have become aware of the critical results after the time δMWT+δPL, they will be able to get some of the still available return during the interval δPIL. Hence, the remaining return after the bibliometric indexing time will be likely low. This stresses the need for a very short indexing delay δPIL as a necessary requirement for any indexing database.

2.2 Bibliometric databases as main sources of research insight

Bibliometric databases are systematic and extensive digital collections of bibliometric records (i.e., metadata) referring to the published work. They cover journals, conference proceedings, patents, books and book chapters, standards, reports, and many more. Their role in assessing various research-related facets has become critical mainly due to the difficulty to tackle the continuously increasing number of scientific materials published every year and the way academic communities quantify the quality and impact of the scientific work and, thus, evaluate researchers or research teams [6]. This subsection investigates the types of bibliometric metadata fields used to provide research insight. We selected three popular, domain-related databases (i.e., DBLP, IEEE Xplore, and PubMed) and three more general, multi-domain databases (e.g., Clarivate’s Web of Science, Scopus, and Google Scholar). The databases are summarized next.

DBLP was established at the University of Trier in 1993 as an open bibliographic database devoted to the computer science literature. The general structure of a publication metadata record is inspired by the BibTeX syntax and is simple because it includes only essential elements, like title, author names, and publication-related specifications.

IEEE Xplore was launched in 2000 and has since become a reliable source of information for researchers who are looking for timely information on current trends and advances in the specific field of electrical and electronics engineering, computer science, telecommunications, and other related technical areas. It hosts content published under the aegis of the Institute of Electrical and Electronics Engineers (IEEE) and other affiliated associations. The bibliographic metadata record for each indexed publication provides a rich set of data fields, including publication keywords and abstract, different fields for citation counts and patent citation counts, download counts, authors’ affiliations, and so on.

PubMed is an open archive dedicated to biomedical and life sciences journals. It is supported by the U.S. National Institutes of Health (NIH). It has been available since year 1996 and uses a complex journal paper metadata structure, which besides fields, like article title, keywords, abstract, authors, affiliations, publication specifications, and citation count, also includes a publication type element (e.g., Journal Article, Letter, Review, Retracted Publication) and an element describing the state of a publication (i.e., ppublish - printed on paper, epublish - digital publication, or ahead of print).

Web of Science was established in 1997 as a multidisciplinary subscription-based bibliographic database that encloses a wide spectrum of scientific fields and disciplines in the humanities. Its content is filtered using an evaluation and inclusion procedure that considers a comprehensive set of criteria including peer review, influence, opportuneness, and geographic coverage of publications. Besides giving a complex bibliographic record for publications, Web of Science also gives a set of related instruments, for example, Journal Citation Reports to present information about the impact factors of scientific journals.

Scopus is a large and multidisciplinary indexing, abstract, and citation bibliographic database, established in 2004 by Elsevier. It covers a large variety of scientific topics from social and human sciences to science, technology, and medicine. Such broad topics were previously curated by independent consultants and subject-matter experts. Scopus utilizes a comprehensive publication metadata structure that is made available to the public through subscription.

Google Scholar is a multidisciplinary and free of charge database that stores scholarly publication metadata since year 2004. It uses a search methodology based on website crawling and indexing and not human expert-curated cataloging. Therefore, querying Google Scholar may often provide inconclusive or sometimes misleading results, hence being up to the researchers to determine if the obtained metadata records are relevant and suitable for their goals. For Google Scholar, the publication metadata structure is reduced, as it does not incorporate some standard indexing fields, like Digital Object Identifier (doi), authors’ unique IDs of their affiliations, ISSN or ISBN, and so on.

There is a large set of bibliometric metadata fields that are likely to be used to support research-related activities. Their coverage in the bibliometric databases is different, as summarized in Table 2. Due to the specifics of each database (e.g., covered domains and existing metadata fields), the type and depth of insights can vary, sometimes needing the accessing of more databases to acquire the needed insightful and precise information.

Paper Metadata FieldDBLPIEEE XplorePubMedClarivate’s WoSScopusGoogle Scholar
title
abstract
keywords
document ID
document type (e.g., review, article)
access (e.g., open, close, unavailable)
author name
author affiliation
author ID
references
citation count
citing patent count
downloads count
year

Table 2.

Journal Paper Metadata Fields to support research-related activities.

The specific usage of bibliographic metadata fields in research activities can be classified into the following categories:

  • Publication content categorization is generally based on summarizing the content of a publication based on its fields ‘Title,’ ‘Keywords,’ and ‘Abstract’ and sets of key terms. Text representations, like the classic bag-of-words [18] and bag-of-entities [9], are used. As a result, the scientific output is categorized depending on the purpose of its use into either fine-grained themes (e.g., Multi-Factor Authentication or Generative Adversarial Networks) or coarse-grained research areas (e.g., Information Security or Machine Learning) [10]. Specific Natural Language Processing (NLP) techniques are used to support bibliographic data acquisition and preprocessing, key term extraction, key term frequency analysis, topic modeling, and text classification [19, 20, 21].

  • Research trend analysis is a systematic procedure meant to reveal and understand scientific trends and detect hot topics in research. Two different approaches have been pursued; one considers the time evolution of the number of publications associated with the key term and set of key terms characterizing the research domain [11], and one utilizes the citation analysis of the domain-relevant publications and any resulting bibliographic coupling [13, 22]. Both types of methods rely on time series modeling and prediction techniques, such as Mann–Kendall test or Sen’s slope estimator, or on burst detection algorithms [11].

  • Research gap discovery requires the analysis of the bibliographic metadata to unveil non-existent or weak links inside sets of key terms during a specified time interval [5, 12]. If there is sufficient scientific knowledge to support each individual key term and the links between them remain insignificant, a possible research gap may have been identified [23]. The endeavor to discover research gaps is generally based on evaluating the time distribution of key term times and their co-occurrences in a given publication metadata corpus.

  • Researcher and institution assessment is the continuous process to characterize the research activities of both individuals and organizations. This process is increasingly data-driven and reliant on metrics [24]. The evaluation generally relies on investigating scientific production in terms of quantity, quality, and impact based on carefully selected bibliometric-related criteria and indicators (e.g., h-index and Field-Weighted Citation Impact (FWCI)) [7].

  • Research theme framing is a complex process that is tightly linked to the scientific gap discovery and the research trend evaluation but must also consider the experience and expertise of the researchers possibly involved in these themes. Because not every research gap is a feasible starting point for new research initiatives, it is critical to choose research gaps that provide sufficient scientific innovation prospects at that particular time moment. Various NLP and machine learning (ML) techniques and available bibliographic resources can be utilized [23].

  • Bibliography recommendation assumes searching for the most appropriate references to start surveying the state-of-the-art in a given scientific domain. According to Ref. [25], it can be formalized as a ranking problem with two steps: (i) publication selection, where a set of candidate references are identified, and (ii) publication ranking, where candidate documents are sorted based on carefully predefined criteria. The process generates a reference list characterized by the appropriate domain coverage, representativeness, and timeliness. It contains a list of review papers to offer a broad perspective on the given domain and a set of research articles to give deeper insight into the research topic. To obtain a higher accuracy, this recommendation model has to rely not only on paper metadata but also on indirect features, like citation analysis over time and scientific quality assessment [26, 27].

  • Research team formation is an important task when dealing with new scientific themes for which there is still no established community. If the research project is complex, multidisciplinary, or the number of candidates is large, then the bibliographic information can be a solid starting point [28]. The candidates are selected based on a quantitative and qualitative analysis that considers both the technical expertise and teamwork traits of the candidates. This information can be extracted from bibliometric data [29].

Table 1 describes the metadata fields that, we think, are suitable for research activities.

Advertisement

3. Importance of using bibliometric information to improve research activities

There has been an unprecedented increase over the last decades in the research activity and research outputs. This includes traditional areas, like physics, computer science, electrical engineering, medicine, and so on, as well as new, emerging areas, such as machine learning, neuroscience, biochemistry, bioinformatics, and other areas. The sheer volume of new information and knowledge produced every single day originates new challenges and opportunities. For example, the large number of published research outcomes can complicate the process of tracking the studied ideas, the objective and comprehensive comparison of the ideas to understand their merits and limitations, and the discovery of any research gaps in a field of study. These issues become more acute in cross-disciplinary research. Therefore, effective framing of new research questions in a timely and systematic way is increasingly difficult and arguably still pursued in an empirical way guided by personal experience and preferences. Subsequently, identifying research teams with an expected high success rate for a new research problem is difficult without more advanced automated support to quantify the existing literature, the current research needs, and the available pool of expertise. Hence, it is possible that the existing research resources, for example, the research funding, are utilized in a less than optimal way.

Transforming the way in which research communities operate to address research needs by using existing resources requires a new way of processing and utilizing the outputs of a research community, for example, its publications, driven by automated or semi-automated algorithms that assess the evolution of the research needs and ideas, track problems that are over-studied and topics that are ignored, describe the expertise and team-working skills of researchers, and possibly suggest collaborators and team structures. Existing work, including our own research [23, 30], has studied these needs, often based on NLP of electronic documents to perform activities, like topic modeling, document clustering, keyword extraction, trend modeling, identifying high-impact papers and researchers, characterizing the implicit and explicit structure of research domains, and so on. Existing work has achieved some notable results, like defining new metrics to describe the impact of papers and authors [30, 31], topic modeling [32], and document summarizing [33]. Still, many open problems remain, especially related to using semantic information in characterizing and optimizing research work.

Advertisement

4. Research knowledge development process

Research is often characterized as the process of investigating the available knowledge sources and materials to establish facts and suggest new research needs [4]. Novel research results often reuse, link, and combine fragments of the existing body of knowledge [5, 34] and, less often, restructure the existing knowledge to formulate new theories and models in purely creative processes (e.g., through sudden insight) [1].

Figure 3 summarizes the knowledge flow of this process.

Figure 3.

Knowledge development process for a specific research field.

The system in Figure 3 for automating and optimizing the new knowledge development process first must collect all the results created by a research field, including electronic publication in peer-reviewed databases, repositories, blogs, company publications, and other documents. The repository arguably contains the entire body of knowledge acquired during the time and includes theories, methods, datasets, experiments, case studies, and the scientific literature. This repository should be in a format that is easy to inquire, possibly including variants based on large language models (LLMs). We argue that there are several ways in which this repository can be developed:

  1. In-domain research

  2. Knowledge transferred from other domains

  3. Inter−/trans- and multi-domain research

  4. Pure-new knowledge coming from sudden insights

In-domain research refers to incremental research, as the needed knowledge is already available within the domain under investigation. This is why such research work mainly focuses on reusing, linking, and combining in a new way the existing ideas in order to obtain novel research outcomes.

A second possibility to extend the body of knowledge in the domain under investigation is to transfer and adapt knowledge from other fields. This is popular in applied domains, where traditionally research outcomes in the theoretical domains were mapped to solve applications. Arguably, an effective way to originate new research ideas can be achieved, if possible, by transferring knowledge from related fields, that is, conceptually similar topics, methodologies, and algorithms. Also, important contributions may come from fields that are on an upward trend or which are expected to be peak domains in the near future, that is, emerging domains.

Also, collaborations between researchers from distinct fields address cross-disciplinary research needs, which includes multi-, inter-, or trans-disciplinary research, where parts of the results may enhance the knowledge repository of the domain under investigation [35].

The last way to grow the repository is through emergent innovative knowledge, that is, sudden insight [1]. Such insight can be disruptive and spawn new opportunities and ideas to be studied.

Advertisement

5. System architecture for automated research problem framing and exploration

This section discusses our methodology as well as other existing approaches to automate the process of framing new and feasible research problems, recommend relevant scientific literature, and suggest possible teams to study the problem. Existing approaches differ by the degree of human participation, that is, the level of human supervision, as the human expert is expected to guide the process according to her/his goals and expectations.

The input to the system is a dataset containing scientific paper metadata (i.e., titles, keywords, and abstracts) from significant publications, which may effectively illustrate the research work in the field of study and in related or emerging domains. To acquire the needed dataset, various metadata and records are typically extracted from influential bibliometric databases, like Clarivate Web of Science, Scopus, PubMed, and IEEE Xplore. Our system, since it targets research in information technology and electric and electronic engineering, uses IEEE Xplore as its bibliometric data source. The previous sections summarized the main datasets and explained their pros and cons in terms of the offered information. Our ideas on effective organization of such datasets to aid automated research problem framing and exploration are also presented here.

Our system incorporates four modules that can act either together, as shown in Figure 1, or as standalone modules:

  1. Research theme recommender module

  2. Cross-domain knowledge transfer recommender module

  3. Scientific literature recommender module

  4. Research team recommender module

5.1 Research theme recommender module

This module provides a list of main and feasible research themes by analyzing the state of research within the domain and by evaluating the opportunity and feasibility of the themes by carrying out trend and statistical analysis. The corpus of relevant paper metadata is employed to identify a comprehensive list of domain-specific key terms, discover the research gaps within the domain, and evaluate the opportunity and feasibility of investigating a topic corresponding to such research gaps.

Framing new research themes has historically been grounded in three fundamental scientometric methodologies and their combinations [23, 36]: (i) investigating the trends in scientific output to reveal the evolutionary dynamics that govern the scientific production in a specific field [37], a typical instance being the employment of Mann–Kendall test, Sen’s slope assessment. and Kleinberg’s burst method to investigate key terms’ popularity and trends [11]; (ii) analyzing citation networks by tracking the scientific community’s interest on a given topic based on the number of citations or co-citations to map the intellectual structure and evolution of a scientific field [2, 38]; and (iii) conducting content analysis either on the paper metadata fields, like ‘Title,’ ‘Keywords,’ and ‘Abstract,’ or on full-text documents by employing topic modeling coupled with co-word analysis to extract scientific topics and explore their time evolution [39, 40]. While the aforementioned methods may provide valuable insights into the dynamics of a research domain to aid scientific theme framing, they do not directly tackle the discovery of research gaps.

In order to formulate new and viable research themes from bibliographical metadata, we proposed a practical methodology that relies on formulating the discovery of feasible research gaps as a graph theoretical problem centered on key term co-occurrences [23]. Starting from the assumption that a scientific domain can be mapped to an integer-weighted undirected graph, where the nodes represent the domain’s key terms and the co-occurrence matrix extracted from paper metadata is the graph’s cost adjacency matrix, we model the identification of viable scientific gaps as a particular sub-graph extraction problem, where the weights associated to the edges need to be confined to a specified interval to assure a desired level of innovation- and success-related prospects. In our view, to automatically identify feasible research gaps to work on, we must follow the following sequence of steps:

Step 1: Acquire a paper metadata corpus corresponding to the journals and conference proceedings that characterize the considered scientific domain.

Step 2: Using entity linking techniques (e.g., TagMe) applied to the metadata fields ‘Title,’ ‘Keywords,’ and ‘Abstract,’ identify a finite set of key terms that characterize the scientific domain, and compute their co-occurrences to build the domain’s weighted undirected graph;

Step 3: Employing a double-threshold procedure that presumes deriving a success threshold α and a novelty threshold β from statistical data, drop the edges from the domain’s weighted undirected edges that do not lie in an interval αβ;

Step 4: Extract all connected subgraphs (i.e., models of research gaps) having a predefined number of nodes, rank them according to the mean co-occurrence value μ (i.e., average of all subgraph weights), and present these recommendations to the user.

This research theme recommender is presented in detail in Ref. [23] along with an illustrative case study from the Electronic Design Automation (EDA) domain.

5.2 The cross-domain knowledge transfer recommender module

This module identifies potential sources of relevant knowledge transfers that help solving the recommended research problem. Document similarity assessment and topic modeling techniques can point toward twin and emerging domains to find methods or materials that relate to the current research theme.

The interest in developing automatic techniques to identify suitable inter-domain knowledge transfers is arguably still in a pioneering stage, even though such methods may provide feasible and straightforward solutions to a wide spectrum of research problems. For example, employing a citation network extracted from scientific publications belonging to aviation and sustainability fields, Nakamura et al. proposed a recognized-unrecognized matricial model to highlight the neglected problems and identify suitable knowledge for developing an innovative water/air transport system [41]. A similar approach was adopted by Ogawa and Kajikawa [42]. They fused knowledge identified for the fields of fuel cell technology and ammonia synthesis to propose novel research topics based on time-series analysis of citation networks and keyword similarity. An interesting experiment was reported by Ittipanuvat et al., which investigated hidden knowledge connections between two totally different scientific areas, namely robotics and gerontology, to develop new robotic applications for disabled and elderly people assistance [3]. As it may be observed, these studies focus on specific situations rather than trying to offer methods for generic cross-domain knowledge transfers.

In the context of design of electronic analog and mixed-signal (AMS) circuits, our previous work has devised a new model and the necessary circuit schematic mining algorithms to describe representing AMS design metaknowledge [43]. The representation has three parts: the associative component expresses the hierarchy of the concepts used in the circuit designs; the part to present the performance of the AMS circuits, like performance tradeoffs and bottlenecks; and the causal modeling part that presents the likely starting ideas and steps used to design a circuit. The same approach used to devise the metaknowledge representation for AMS circuits was then utilized to design a metaknowledge representation for other open-ended problems, including architectural engineering and general-domain problems [44]. Finally, a cognitive architecture, called InnovA, was suggested to use the metaknowledge representation to automatically generate new AMS circuit designs, either by incrementally modifying circuits in the representation or by combining features of distinct circuits from the knowledge representation [45].

In our view, a cross-domain knowledge transfer recommender should be based on the following key aspects: (a) scientific domains and research themes may effectively be modeled by sets of key terms; bibliographic metadata are authoritative and comprehensive sources of information about scientific output; and (b) NLP techniques can effectively extract and interpret information from textual data to identify hidden links or patterns between the bodies of knowledge of two different domains. Our suggested approach is centered around the concept of ‘twin scientific domains’ (i.e., scientific areas sharing similar or related research topics, notions, theories, materials, and methods), since the similarity of such domains is likely to provide reasonable knowledge transfers. Thus, to discover feasible knowledge transfers, we have to follow three phases:

  1. Bibliographic metadata acquisition and processing to build a domain-characteristic text file for each of the scientific domains (i.e., both target and potential twin domains) under investigation. This stage employs an entity linking method to construct a document containing the list of all key terms encountered in the paper metadata fields ‘Title,’ ‘Keywords,’ and ‘Abstract’ for each of the mentioned scientific domains.

  2. Twin domain identification, where a document similarity method is applied to rank the potential twin domains, based on their document-characteristic text file similarity with the one corresponding to the target domain.

  3. Investigating the body of knowledge in each of the best-ranked twin domains by evaluating the key term co-occurrences and suggesting the knowledge transfers, which are characterized by considerably higher co-occurrences in the twin domain as compared to the one in the target domain.

5.3 Scientific literature recommender module

To help researchers establish a suitable starting point from where to conduct the literature review, the proposed system directs the search in two directions: an in-domain exploration to find high-impact work on a research theme and twin/emerging domain explorations to find relevant papers on the knowledge we intend to transfer.

Previous work has focused on devising metrics to describe the research impact of individuals, like the well-known h-index. However, there has been significantly less effort to characterize the role of a research group in a community. Our work has focused on this gap by conducting an extensive experimental study on the role a group plays to the flow of ideas in a research community [30]. The study considered the contribution of groups to the impact research ideas make on other groups and the way a group contributes to linking a community together. Experiments have also indicated that groups can be clustered into four categories based on the count of the references to their work. Groups of the two mid-tier categories have a main role in bridging between groups, thus in linking them in a well-connected community. This insight is important as more fragmented communities, like those with fewer mid-tier groups, are likely to be less effective.

This recommender system is meant to aid researchers in reducing the number of publications that are subjected to (manual) literature review and is based on exploring the bibliographical databases with carefully tailored search strategies that are implemented in search queries. To provide a reliable starting point for the research theme, a scoping review procedure [46] that offers a structured and comprehensive image of the research theme with respect to all its facets has to be carried out [14, 47]. Hence, the user has to pursue the following steps [48]: (i) choose the appropriate bibliographical database to search by investigating the databases’ domain specificity and adequateness; (ii) pick key terms to search for by considering both the research theme needs and the bag of key terms likely to be extracted from the metadata fields ‘Title,’ ‘Keywords,’ and ‘Abstract’ (e.g., by using TagMe, WikiMiner, or RedW entity linking methods for which the key terms should represent Wikipedia mentions); (iii) decide the boundaries of the systematic review by targeting only a defined set of publications (i.e., employing ‘publication title’ metadata field) and/or publications from a given time interval (i.e., using the metadata field ‘Year’); and (iv) select the type of papers to be explored, namely surveys and/or research papers, this information being made available by the metadata field ‘Article type.’ The search query configuration is neither a simple nor an easy task, as it relies on the user’s experience and expertise.

Recently some interesting literature recommending environments, suited for the biomedical field, have been made available. LitSuggest employs a large set of classifiers (e.g., ridge classifier, elastic net classifier, etc.) to decide, based on a user-provided list of papers, if other publications existing in PubMed database are relevant, irrelevant, or ambiguous. For this, the metadata fields ‘Title,’ ‘Abstract,’ ‘Keywords,’ ‘Journal,’ and ‘Publication type’ are utilized in a bag-of-words representation [49]. Another Web-based software tool to suggest scientific literature is RobotAnalyst, which combines ML algorithms (i.e., classifiers and topic modeling) and the metadata field ‘Abstract’ text-mining to classify and rank references by their semantic similarity [15]. While effective in biomedical domains, such approaches have some inherent limitations (e.g., the inability to cope with different research theme granularities, the reduced adaptability to other scientific domains, ignoring the influence of high-visibility individuals and groups in the field, and reduced explainability and transparency due to the type of used ML models). Hence, a more in-depth analysis of the literature recommending process is needed.

5.4 Research team recommender module

A set of possible teams to carry out a specific research theme can be automatically proposed by solving a complex and multi-objective team formation optimization problem by analyzing the corpus of paper metadata to extract information about the expertise and teamwork skills associated to researchers in a database.

Among the first to employ bibliographic metadata in research team formation are Lappas et al. [28]. Based on four paper metadata fields acquired from DBLP database (i.e., fields ‘Title,’ ‘Authors,’ ‘Publication year,’ and ‘Publication name’), they took into account each candidate’s technical proficiency to complete the task requirements as well as their teamwork skills to build effective research teams. Since then, despite its limited topic coverage and simplified publication metadata structure, the DBLP bibliographic database has been extensively utilized as a source of information for computer science researchers to assess diverse team-shaping strategies [29, 50]. Although several studies suggested obtaining data from more extensive databases including PubMed [51] or Google Scholar [52], they remain grounded in the same limited selection of metadata fields, which may offer an imprecise and insufficient depiction of potential team members. Therefore, the full potential of the use of bibliographic information for effective team formation has not yet been realized, requiring ongoing efforts to build new or improve upon existing techniques. In our opinion, the focus should be directed toward developing new theoretical and computational models of collaborative research work and finding a suitable set of candidate-related metrics that can be computed from bibliographic information. For this, the number of considered bibliographic metadata fields has to be increased by employing the citation and download counts, authors’ affiliations, or publication abstract when extracting personal and interpersonal candidates’ competencies. Moreover, appropriate multi-objective team formation optimization procedures relying on candidate-related indicators derived from bibliographic metadata need to be devised and adapted to particular research theme contexts.

5.5 Using the proposed architecture in research

This section discusses a possible procedure of using the system architecture in Figure 1 for automated research problem framing and exploration. Other alternatives of using the architecture can be devised, too.

The first step in the procedure is the selection of a potential research theme by the likely principal investigators (PIs) of a future research project. The Research Theme Recommender module is used for this task. Using keywords describing current, main research themes, the PIs identify a set of themes, like three to five research topics, that cover the keywords but are less studied by the existing scientific literature. The method in Ref. [23] or other similar methods can be used for this step. For the selected themes, opportunities for transferring knowledge across the related domains are then identified using the Cross-Domain Knowledge Transfer Recommender module. The feasibility of these opportunities is analyzed next, such as an algorithm already used in effectively describing Integrated Chip layouts is identified as an interesting approach to represent any visual data, too. The next two steps select the scientific literature for the project and make recommendations on possible research teams. The Scientific Literature Recommender module offers a starting point on the research papers that support an identified cross-disciplinary theme. New related papers are identified as the research work is carried out. Team selection utilizes the Research Team Recommender module. A dataset available to form cross-disciplinary teams of members from “Politehnica” University Timisoara is available at Ref. [53]. The procedure can be repeated to adjust the keywords used for research theme recommendation or to select other research themes among those recommended and explore other cross-domain knowledge transfer opportunities.

Advertisement

6. Conclusions

This chapter discussed the current opportunities and challenges in using automated methodologies and software tools to analyze published research outcomes present in bibliometric databases, like digital books, book chapters, journal articles, conference proceedings, and so on. The goal is to maximize the quality and impact of new research. The chapter enumerated the main features of the most frequently used bibliographic databases, like DBLP, IEEE Xplore, PubMed, Web of Science, Scopus, and Google Scholar. The main characteristics of the databases were also introduced, for example, conciseness, relevance, selectivity, and objectivity. The impact of outdated bibliometric data was mentioned, and a correction method was suggested. The chapter proposed a novel system architecture to automatically support the research knowledge development process. The system is a collection of recommender modules that target the different activities of the knowledge development process, for example, publication content categorization, research trend analysis, research gap discovery, researcher and institution assessment, research problem framing, bibliography recommendation, and research team formation. Future work is expected to address each of the enumerated activities, so that the potential of automated processing of bibliometric data is fully harnessed.

References

  1. 1. Kuhn TS. The Structure of Scientific Revolutions. Chicago, IL, USA: University of Chicago; 1996
  2. 2. Upham S, Small H. Emerging research fronts in science and technology: Patterns of new knowledge development. Scientometrics. 2010;83(1):15-38
  3. 3. Ittipanuvat V, Fujita K, Sakata I, Kajikawa Y. Finding linkage between technology and social issue: A literature based discovery approach. Journal of Engineering and Technology Management. 2014;32:160-184
  4. 4. Leedy PD, Ormrod JE. Practical Research. Saddle River, NJ, USA: Pearson Education; 2005
  5. 5. Liu X, Doboli A, Doboli S. Bottom–up modeling of design knowledge evolution: Application to circuit design community characterization. IEEE Transactions on Computational Social Systems. 2020;8(3):689-703
  6. 6. de Oliveira OJ, da Silva FF, Juliani F, Barbosa LC, Nunhes TS. Bibliometric method for mapping the state-of-the-art and identifying research gaps and trends in literature: An essential instrument to support the development of scientific projects. In: Scientometrics Recent Advances. London, UK: IntechOpen; 2019
  7. 7. Campbell D, Picard-Aitken M, Côté G, Caruso J, Valentim R, Edmonds S, et al. Bibliometrics as a performance measurement tool for research evaluation: The case of research funded by the National Cancer Institute of Canada. American Journal of Evaluation. 2010;31(1):66-83
  8. 8. Curiac CD, Banias O, Micea M. Evaluating research trends from journal paper metadata, considering the research publication latency. Mathematics. 2022;10(2):233
  9. 9. Gonzalez, Pinto JM, Balke WT. Demystifying the semantics of relevant objects in scholarly collections: A probabilistic approach. In: Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries. Knoxville TN, USA: IEEE; 2015. pp. 157-164
  10. 10. Zhang Y, Jin B, Chen X, Shen Y, Zhang Y, Meng Y, et al. Weakly supervised multi-label classification of full-text scientific papers. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Long Beach, CA, USA: ACM; 2023. pp. 3458-3469
  11. 11. Marrone M. Application of entity linking to identify research fronts and trends. Scientometrics. 2020;122(1):357-379
  12. 12. Juliani F, de Oliveira OJ. State of research on public service management: Identifying scientific gaps from a bibliometric study. International Journal of Information Management. 2016;36(6):1033-1041
  13. 13. Boyack KW, Klavans R. Creation of a highly detailed, dynamic, global model and map of science. Journal of the Association for Information Science and Technology. 2014;65(4):670-685
  14. 14. Azarian M, Yu H, Shiferaw AT, Stevik TK. Do we perform systematic literature review right? A scientific mapping and methodological assessment. Logistics. 2023;7(4):89
  15. 15. Przybyła P, Brockmeier AJ, Kontonatsios G, Le Pogam M-A, McNaught J, von Elm E, et al. Prioritising references for systematic reviews with robot analyst: A user study. Research Synthesis Methods. 2018;9(3):470-488
  16. 16. Andersen MZ, Fonnes S, Rosenberg J. Time from submission to publication varied widely for biomedical journals: A systematic review. Current Medical Research and Opinion. 2021;37(6):985-993
  17. 17. Happel HJ, Stojanovic L. Analyzing organizational information gaps. In: Proceedings of the 8th International Conference on Knowledge Management. Graz, Austria. 2008. pp. 28-36
  18. 18. Joorabchi A, Mahdi AE. An unsupervised approach to automatic classification of scientific literature utilizing bibliographic metadata. Journal of Information Science. 2011;37(5):499-514
  19. 19. Akundi A, Euresti D, Luna S, Ankobiah W, Lopes A, Edinbarough I. State of industry 5.0–analysis and identification of current research trends. Applied System Innovation. 2022;5(1):27
  20. 20. Curiac C-D, Micea MV. Identifying hot information security topics using LDA and multivariate Mann-Kendall test. IEEE Access. 2023;11:18374-18384
  21. 21. Chen H, Wang X, Pan S, Xiong F. Identify topic relations in scientific literature using topic modeling. IEEE Transactions on Engineering Management. 2019;68(5):1232-1244
  22. 22. Hopcroft J, Khan O, Kulis B, Selman B. Tracking evolving communities in large linked networks. Proceedings of the National Academy of Sciences. 2004;101(suppl_1):5249-5253
  23. 23. Curiac C-D, Doboli A, Curiac DI. Co-occurrence-based double thresholding method for research topic identification. Mathematics. 2022;10(17):3115
  24. 24. Hicks D, Wouters P, Waltman L, De Rijcke S, Rafols I. Bibliometrics: The Leiden manifesto for research metrics. Nature. 2015;520(7548):429-431
  25. 25. Bhagavatula C, Feldman S, Power R, Ammar W. Content-based citation recommendation. arXiv:1802.08301. 2018. DOI: 10.48550/arXiv.1802.08301
  26. 26. Nair AM, Benny O, George J. Content based scientific article recommendation system using deep learning technique. In: Inventive Systems and Control: Proceedings of ICISC. Singapore: Springer; 2021. pp. 965-977
  27. 27. Chaudhuri A, Sinhababu N, Sarma M, Samanta D. Hidden features identification for designing an efficient research article recommendation system. International Journal on Digital Libraries. 2021;22(2):233-249
  28. 28. Lappas T, Liu K, Terzi E. Finding a team of experts in social networks. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Paris, France: ACM; 2009. pp. 467-476
  29. 29. Hamidi Rad R, Fani H, Bagheri E, Kargar M, Srivastava D, Szlichta J. A variational neural architecture for skill-based team formation. ACM Transactions on Information Systems. 2023;42(1):1-28
  30. 30. Liu X, Doboli A, MacCarthy T, Doboli S. Understanding the significance of mid-tier research teams in idea flow through a community. IEEE Transactions on Computational Social Systems. 2023;10(6):3422-3432
  31. 31. Kim D, Lee B, Lee HJ, Lee SP, Moon Y, Jeong MK. Automated detection of influential patents using singular values. IEEE Transactions on Automation Science and Engineering. 2012;9(4):723-733
  32. 32. Blei DM. Topic modeling and digital humanities. Journal of Digital Humanities. 2012;2(1):8-11
  33. 33. Allahyari M, Pouriyeh S, Assefi M, Safaei S, Trippe ED, Gutierrez JB, et al. Text summarization techniques: A brief survey. arXiv:1707.02268. 2017. DOI: 10.48550/arXiv.1707.02268
  34. 34. Doboli A, Umbarkar A, Subramanian V, Doboli S. Two experimental studies on creative concept combinations in modular design of electronic embedded systems. Design Studies. 2014;35(1):80-109
  35. 35. Vugteveen P, Lenders R, Van den Besselaar P. The dynamics of interdisciplinary research fields: The case of river research. Scientometrics. 2014;100:73-96
  36. 36. Mazov NA, Gureev VN, Glinskikh VN. The methodological basis of defining research trends and fronts. Scientific and Technical Information Processing. 2020;47:221-231
  37. 37. Liu X, Jiang T, Ma F. Collective dynamics in knowledge networks: Emerging trends analysis. Journal of Informetrics. 2013;7(2):425-438
  38. 38. Shiau WL, Wang X, Zheng F. What are the trend and core knowledge of information security? A citation and co-citation analysis. Information & Management. 2023;60(3):103774
  39. 39. Sivanandham S, Sathish Kumar A, Pradeep R, Sridhar R. Analysing research trends using topic modelling and trend prediction. In: Soft Computing and Signal Processing: Proceedings of 3rd ICSCSP 2020. Vol. 1. Singapore: Springer; 2020. pp. 157-166
  40. 40. Mohammadi E, Karami A. Exploring research trends in big data across disciplines: A text mining analysis. Journal of Information Science. 2022;48(1):44-56
  41. 41. Nakamura H, Ii S, Chida H, Friedl K, Suzuki S, Mori J, et al. Shedding light on a neglected area: A new approach to knowledge creation. Sustainability Science. 2014;9:193-204
  42. 42. Ogawa T, Kajikawa Y. Generating novel research ideas using computational intelligence: A case study involving fuel cells and ammonia synthesis. Technological Forecasting and Social Change. 2017;120:41-47
  43. 43. Jiao F, Montano S, Ferent C, Doboli A, Doboli S. Analog circuit design knowledge mining: Discovering topological similarities and uncovering design reasoning strategies. IEEE Transactions on CADICS. 2015;34(7):1045-1058
  44. 44. Doboli A, Umbarkar A, Doboli S, Betz J. Modeling semantic knowledge structures for creative problem solving: Studies on expressing concepts, categories, associations, goals and context. Knowledge-Based Systems. 2015;78:34-50
  45. 45. Li H, Liu X, Jiao F, Doboli A, Doboli S. Innova: A cognitive architecture for computational innovation through robust divergence and its application for analog circuit design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 2017;37(10):1943-1956
  46. 46. Arksey H, O’Malley L. Scoping studies: Towards a methodological framework. International Journal of Social Research Methodology. 2005;8(1):19-32
  47. 47. Booth A, Sutton A, Clowes M, Martyn-St JM. Systematic Approaches to a Successful Literature Review. 3rd ed. London, UK: SAGE Publications; 2022
  48. 48. Wanyama SB, McQuaid RW, Kittler M. Where you search determines what you find: The effects of bibliographic databases on systematic reviews. International Journal of Social Research Methodology. 2022;25(3):409-422
  49. 49. Allot A, Lee K, Chen Q, Luo L, Lu Z. LitSuggest: A web-based system for literature recommendation and curation using machine learning. Nucleic Acids Research. 2021;49(W1):W352-W358
  50. 50. Juang MC, Huang CC, Huang JL. Efficient algorithms for team formation with a leader in social networks. The Journal of Supercomputing. 2013;66:721-737
  51. 51. Neshati M, Beigy H, Hiemstra D. Expert group formation using facility location analysis. Information Processing and Management. 2014;50(2):361-383
  52. 52. Srivastava B, Koppel T, Paladi ST, Valluru SL, Sharma R, Bond O. ULTRA: A data-driven approach for recommending team formation in response to proposal calls. In: 2022 IEEE International Conference on Data Mining Workshops (ICDMW). Orlando, FL, USA: IEEE; 2022. pp. 1002-1009
  53. 53. Curiac CD, Micea M, Plosca TR, Curiac DI, Doboli A. Dataset for bibliometric data-driven research team formation: Case of Politehnica University of Timisoara scholars for the interval 2010-2022. Data in Brief. 2024;53:110275. DOI: 10.1016/j.dib.2024.110275

Written By

Christian-Daniel Curiac, Mihai Micea, Traian-Radu Plosca, Daniel-Ioan Curiac, Simona Doboli and Alex Doboli

Submitted: 31 January 2024 Reviewed: 13 February 2024 Published: 10 June 2024