Improving the Relevance, Speed, and Computational Efficiency of Semantic Search through Database Indexing: A Review

Xiaofeng Yang

doi:10.5772/intechopen.112232

Abstract

In a large dataset, finding top most relevant matches to fulfill search queries and similar application becomes an increasing challenge. Facing this challenge, researchers and engineers has done extensive work in both regime of algorithm design and system implementation to create practical solutions that fuel powerful tools. Techniques like caching, search index, fuzzy indexing has been widely adopted in both SQL and XML data models, then provided to the database user, the application programmer, as a product. In this chapter, we review data indexing algorithms that efficiently compute semantic search query results, and discuss systems that implement them. We compare the pros and cons of each algorithm, in terms of time and space complexity, as well as implementation difficulty. We also analyze trade-offs in real-world system design examples, in hope to provide the readers more insights when they want to efficiently find an answer through semantic search query, or even build their own application using these databases.

Keywords

algorithms
application of algorithms
database systems
web development
relevance improvement

Author Information

Show +

Xiaofeng Yang*
- Google, Mountain View, USA

*Address all correspondence to: xiaofeng@ccs.neu.edu

1. Introduction

Interacting with the web is becoming an essential part of our life, all of us at some point, are web application users: we search for spelling of a word, purchase groceries and clothes from online catalog, play video contents over the phone or on TV, and check our bank statements through our online account. Behind all these applications, there are data storage and retrieval tasks performed many times to power our query, and present accurate results in a timely manner. In this ever increasing scale of web data, finding top most relevant matches becomes increasingly challenging, both to the application provider such as Wayfair and Nordstorm, and to the infrastructure provider that sells computational resource and platforms, such as Amazon AWS. Facing this challenge, researchers and engineers has done extensive work to create practical solutions. Products like SQL databases, NoSQL databases, techniques like caching, search index, have been widely adopted and provided to the application programmer as a product. Web application developers often want to know, out of all these technologies, which tech stack shall we pick to accomplish a task? How do we evaluate the strength and weakness of each option, and pick the solution that best fit our situation?

This is a hard question and often requires some trial and error, along with some collaboration work that involves expertise from multiple areas. However, there are some perspectives that many developers and designers in the industry have came up with, and proven effective in engineering and business challenges. We try to provide some tool and guidelines to our reader, to help build a framework to reason about the choices of these technologies.

We will first discuss data models behind popular web applications, introduce how data is being stored and used, and go through example products. We then discuss a few poplar techniques used in each product to optimize system performance, their concept and history, and how to evaluate the effectiveness of them. We then walk through a special optimization use case in web semantic search, and show how application engineer solves a real problem starting from analyzing the use case.

2. Data models

Data models defines how application programmers reason about the problem, and is usually the first step in designing a web application using databases.

To group databases into two high level categories by their data model and design principals, SQL (Structured Query Language) [1] and NoSQL (Not Only SQL) [2] are two very high-level types of database management systems.

2.1 SQL databases

The family of SQL databases has been around since the early days of web applications, and is usually associated with the query language SQL. A typical SQL database is based on a relational data model, where data is organized into tables with predefined columns and rows. SQL languages are used to query these table. Here is a simple SQL language example query:

SELECT

column1, column2, …

FROM

example_table;

What this query does is to retrieve data stored under the name of column1 and column2, in a table called example table. On top of that, we can specify filtering condition of the data, order by which the data is sorted, or even recursively specify query in the result of another query.

SQL databases have a fixed schema, that dictates the structure of the data, as well as the relationships between tables. Any changes to the schema require modifications to the database schema. This sometimes means additional migration work to an application programmer using a SQL database, they will have to back-fill the new database instance with new schema, which can be time-consuming and error prone [3].

The reason of having a fixed schema is behind the common use cases when SQL databases are first designed: we want them to be ACID-compliant, which means that they guarantee the properties of atomicity, consistency, isolation, and durability for all transactions [4]. These properties help build foundational assumptions the application programmer will rely on: they knows if part of their write to the database succeeded, the full body of this write must have succeeded (atomicity); The database will not give clients contradictory query answers (consistency); The behavior one client interacting with the database will not be affected by other clients the database serves (isolation), and once data is persisted into the database, it will loyally stay there for later usage (durability). These properties have helped application developers to build software for bank transaction without worrying about trapping into a partial result state, like client A seeing the money transferred over while client B not seeing it. However, these properties are usually built based on carefully designed mechanisms such as locks, and can be show to response, and expensive in terms of computational resources required.

2.2 NoSQL databases

The word NoSQL was originally created to describe not-only-SQL representable data models [5]. NoSQL databases therefore cover a large variety of different data models, such as key-value store [6], document-based data model [7], column-family database [8], or graph database [9].

Many NoSQL databases have a flexible schema comparing to SQL databases, allowing for changes to be made to the data model without requiring modifications to the database schema. Adding an additional level of key-value recursively or adding a paragraph in the document can be performed online while the database is still serving incoming traffic. This property has positioned NoSQL databases in great advantage for fast-changing, always-online web applications: in the scenarios when we anticipate frequent data schema change, the adaptation of NoSQL data model takes a significant maintenance burden away from application engineers.

As oppose to SQL databases, which use SQL, a standardized language to manage/manipulate relational data as their query language, NoSQL databases use a variety of query languages that are specific to their data model. These languages may require different syntax and logic than SQL [10].

Why did people propose NoSQL databases? In what scenarios do they show advantage over SQL databases?

One answer comes from their scalability: SQL databases are typically vertically scalable, meaning that they can only handle larger amounts of data by adding more resources to a single server. This is limiting the scale that a SQL database handles, since the complexity and cost of building one single powerful computer grows fast, and is limited by modern computer architecture. On the contrary, NoSQL databases are usually horizontally scalable, meaning that they can handle larger amounts of data by adding more servers to a distributed system. This means as the number of users and workload grows, business owners using a NoSQL database can easily grow their capability by buying more commodity hardware, such as regular computers you can obtain from Amazon, and add them to their database deployment.

As we previously discussed, SQL databases are designed to be ACID-compliant. NoSQL databases make case-by-case sacrifice on some of these properties, for improved scalability or performance.

As a result of all these different design considerations, SQL databases are usually the first choice for applications that perform sophisticated transactions on structured data, while NoSQL databases shine in the domain of remaining applications that require scalability, flexibility, and handling of unstructured or semi-structured data.

3. Case study

In this section we look at real commercial database systems and their usage, to better understand the trade-offs made by engineers.

3.1 SQL database

One well-known SQL DB is MySQL, a popular open-source relational database management system (RDBMS) that is owned by Oracle Corporation. Over the years, it has been widely used by application programmers as a backend database for both web and enterprise applications. MySQL is a popular choice because of its reliable performance, scalability, and ease of use, and can be found under the hood of web-based applications such as online photo editing services, content management systems such as library software, and many small to medium sized e-commerce websites.

Let us consider a simple web application that allows users to register, log in, and post comments on a website. MySQL can be used as the database management system to store and retrieve user information and comments.

3.1.1 User registration

When a user registers on the website, their registration information (such as username, email, and password) is collected through an HTML form through client side UI (user interface).
Upon receiving the form, the web application backend takes all these information and sends a SQL query to the database to insert a new row into the “users” table. The query might look like this:
INSERT INTO users (username, email, password) VALUES
(‘ExampleUser’, ‘user@example.com’, ‘hashed_password’);

where the password is encrypted and not stored in plain text. Each user will only have one password, so each username will only have one record in the table. This means, we can use the column username to identify a unique row in the table. We call this username column a primary key of the table.

After executing the above write query, the database stores this user’s registration information securely as a row in the users table.

3.1.2 User login

When a user tries to log in, our web application asks for their login credentials (username and password) and compares with our own record, to verify that this user is the same person who owns the account.
Upon receiving the HTML form, the application backend sends a SQL query to the database to check if the provided username and password match the row previously stored in the database. The query might look like this:
SELECT * FROM users WHERE username=‘ExampleUser’ AND password=‘hashed_password’;
The database executes the above read query, and returns the result to the web application backend for comparison.
Based on the comparison result, the web application grants or denies access to the user.

3.1.3 Posting comments

After logging in, now the user can post a comment on our page.
Again through HTML form in the UI, the web application collects the comment content along with the logged in user’s ID.
It sends an SQL query to the database to insert a new row into the “comments” table. The query might look like this:
INSERT INTO comments (user_name, content) VALUES
(‘ExampleUser’, ‘This is a great article!’);
The MySQL database stores the comment along with the user’s username, linking the comment to the specific user in the users table. This usage of username across two tables is called a foreign key.

3.1.4 Retrieving comments

At the time when the web application displays comments on a page, it sends an SQL query to the database to fetch the comments for that page. The query might look like this:

SELECT * FROM comments WHERE page_id=1,000,001;

The database executes the read query and returns the comments to the web application backend.
The web application backend then sends the retrieved data to its front-end, and then display the comments on the page.

In the above example, read and write operations are performed on the database. MySQL, as a common SQL database can provide these capabilities, to store and manage user information and comments. The web application interacts with the database by sending SQL queries and retrieving the results, allowing users to register, log in, post comments, and view existing comments on the website.

3.2 NoSQL database: Dynamo DB

We study Amazon Dynamo DB as an example NoSQL database. Amazon DynamoDB is a fully managed, proprietary NoSQL database service offered by Amazon Web Services (AWS). It was first introduced by Amazon in 2012 and is designed to provide low-latency, scalable and highly available database solutions for modern applications.

DynamoDB is a document-oriented, key-value store database, which means that data is stored as key-value pairs or as structured JSON documents.

3.2.1 Performance

As a highly scalable database, DynamoDB can handle petabyte-scale workloads with automatic scaling of throughput capacity, and it provides a rich set of data consistency levels, durability and security features. However, it does not guarantee full ACID (Atomicity, Consistency, Isolation, Durability) properties due to their simplified data model and focus on high-performance and scalability.

In a key-value store, each key-value pair is typically treated as an independent entity, and transactions spanning multiple key-value pairs are not natively supported. Updates to individual key-value pairs are usually atomic, but there is no built-in mechanism to ensure atomicity across multiple key-value pairs.

DynamoDB also prioritize high-performance and scalability by allowing concurrent access to data without extensive locking mechanisms. This can lead to potential data conflicts or race conditions when multiple transactions access and modify the same keys simultaneously.

3.2.2 Application

DynamoDB is optimized for low-latency and high-throughput access patterns, making it suitable for applications that require fast and predictable performance, such as gaming, e-commerce, social media, and IoT applications.

DynamoDB is offered as a fully managed service, meaning that AWS takes care of the infrastructure, scaling, backups, and maintenance of the database, allowing developers to focus on building their applications.

A serverless application is an architecture and computing model where the developer focuses on writing code for specific functions or services, without the need to manage the underlying infrastructure or servers. In a serverless model, the cloud provider takes care of provisioning, scaling, and managing the servers, allowing developers to focus solely on writing and deploying code [11]. DynamoDB also integrates with other AWS services such as AWS Lambda, Amazon Kinesis, and Amazon S3 [12, 13], making it easy to build serverless applications with a variety of use cases.

DynamoDB also offers a global table feature, which allows for automatic replication of tables across multiple AWS regions. This enables low-latency access to data for users distributed worldwide, as well as fault tolerance when machines in some region suffers from unavoidable disasters.

Integrating with AWS using Dynamo DB offers benefits such as reduced infrastructure management, automatic scalability, cost optimization, and rapid development cycles. DynamoDB becomes a popular database service among AWS customers due to its scalability, performance, and ease of use.

4. General optimization techniques

The readers may start to wonder, under the hood of these commercial databases, what algorithms are they using to provide these reduced cost, improved speed and reliability? Two most popular optimization techniques are indexing and caching, and they are widely adopted in traditional RDBMS (relational database management systems), in many variations of forms. However, the simple and powerful main idea is not changing: instead of computing the results every time a query is executed, pre-computed partial or full results are stored and served.

However, these approaches differ on what to store, where to store the partial results, and how to make sure that the stored (cached) data is up-to-date. Different algorithms are proposed to find the best index to store at a certain use case, so that the following goals are achieved: (1) the result is not stale according to the applications’ requirement, and can be directly used by the web application without affecting its correctness; (2) The cost of storing such results is acceptable; (3) For a typical workload in the use case, storing these additional results is effective.

In this section, we will go over details of both indexing and caching and understand how they provide optimization.

4.1 Indexing

4.1.1 Concept and history

An index is a data structure that allows for faster data retrieval by enabling the database to locate and access specific rows more quickly. The concept of database indexing was first introduced in the early days of computer databases, as a means to improve data retrieval efficiency. The initial development of indexing dates back to the 1960s when the first database management systems (DBMS) were being developed.

One of the earliest known instances of indexing in databases can be traced back to IBM’s System R project in the 1970s [14]. System R, a research project at IBM’s San Jose Research Laboratory, focused on the development of a relational database system. System R introduced the concept of indexing to improve the performance of relational database queries.

In 1970, Edgar F. Codd published his groundbreaking paper titled “A Relational Model of Data for Large Shared Data Banks” [15] which laid the foundation for the relational database model. Although the paper did not explicitly mention indexing, it proposed the idea of a primary key as a means to uniquely identify records in a table, which essentially served as an implicit form of indexing.

Over the years, indexing techniques and algorithms have evolved, and different types of indexes, such as B-trees, hash indexes, and bitmap indexes, have been developed to suit various data access patterns and requirements.

It’s important to note that while the concept of indexing emerged in the 1960s and 1970s, the specific techniques and algorithms for indexing have continued to be refined and optimized, driven by advancements in computer hardware, storage systems, and database technologies.

4.1.2 Indexing in database

Here are some key reasons why indexes are created in a modern database:

Improved Query Performance. Indexes provide a way to locate data in a database quickly. By creating an index on one or more columns, the RDBMS can use the index to narrow down the search space and retrieve relevant data more efficiently. This results in faster query execution times, especially when working with large tables or when executing complex queries involving joins or filtering conditions. Instead of scanning the entire table, the database engine can use the index to locate specific rows or a range of rows, reducing disk I/O and improving data retrieval speed. Well-designed indexing can reduce the latency of such queries by order of magnitudes.
Efficient Data Sorting. Indexes can also be utilized to speed up sorting operations. When a query requires sorting the result set based on a specific column, an index on that column allows the RDBMS to retrieve the data in the desired order more efficiently, eliminating the need for a separate sorting step.
Constraint Enforcement. Indexes can enforce the uniqueness of values in a column or set of columns by creating unique indexes. Unique indexes prevent duplicate values from being inserted into the indexed columns, maintaining data integrity and supporting the enforcement of primary key or unique constraints.

It’s worth noting that while indexes improve query performance, they also have some trade-offs. Indexes require additional disk space to store the index data structure, and they can introduce overhead during data modification operations (such as inserts, updates, and deletes) as the indexes need to be updated along with the actual data. In a SQL database where atomicity is required, this usually means locking all the data to modify, resulting into an increased latency in the write operation. It’s essential to carefully design and maintain indexes based on the specific usage patterns, query workload, and data characteristics of the database to achieve optimal performance improvements while considering the associated costs and trade-offs.

4.1.3 Design of Indexing

Here are some of the most popular form of indexing storing different data:

B-tree Indexing. B-trees are commonly used in databases to implement indexing. B-trees are balanced tree structures that allow for fast searches, insertions, and deletions of data. B-trees are optimized for disk-based storage and can be used to index large amounts of data efficiently.
Hash Indexing. Hash indexing uses a hash function to map the values of a column to a specific location in the index. This allows for fast lookups and is commonly used in in-memory databases.
Bitmap Indexing. Bitmap indexing is used to index columns that have a low cardinality (i.e., a small number of distinct values), we call these data sparse. In these cases, storing the data in a bitmap of special values is a more compact representation than storing data in its original format. Bitmap index uses a bitmap to represent each unique value in the column, and each bit in the bitmap represents a row in the table. This allows for fast querying of data, especially for columns that have a small number of distinct values.
Full-Text Indexing. Full-text indexing is used to index text-based data, such as documents or articles. Full-text indexing uses techniques such as tokenization and stemming to create an index of words or phrases that appear in the text. This allows for efficient searching of text-based data.

4.2 Caching

4.2.1 Concept and history

The concept of caching is not only limited in database context: Caching refers to the process of storing frequently accessed or computed data in a temporary storage location that allows for faster retrieval. In a database, this usually means storing the full or intermediate query result in the database system’s memory (or fast access layer), allowing direct presentation of these results for the same or similar queries in the future. Caching aims to improve system performance, reduce latency, and optimize resource utilization.

The cached data is typically kept closer to the requesting entity (such as an application or a user) to minimize the latency associated with accessing the original source or performing expensive computations repeatedly. We can apply caching on the client side of the application, which is closer than the application backend; We can also apply caching on the first server that serves client traffic, which is closer than the database server.

We can cache various types of data, including database query results, web pages, images, API responses, computed values, file contents. The choice of what to cache depends on the specific use case and the data that is accessed frequently or requires significant processing time. If the application requires frequent access of the same file, then we should consider caching that file content. If multiple operations in the application rely on the same computed values, we should consider caching that computed value for repeated usage.

4.2.2 Caching in databases

When talking about caching in databases, we usually mean to cache query results or partial results, for repeated usage in a typical workload where the application performs query on the database. Caching in the database can be categorized into Buffer Caching and Query Result Caching:

Buffer Caching. Buffer caching, also known as disk block caching or page caching, was introduced to speed up disk I/O operations in database systems. It involved caching disk blocks or database pages in memory to reduce the need for physical disk reads and writes. This technique was widely adopted in database management systems (DBMS) in the 1970s and 1980s.
Query Result Caching. The concept of query result caching, where the results of frequently executed queries are cached for faster retrieval, gained prominence in the early 2000s with the development of advanced database systems. It became more prevalent as databases evolved to handle larger data volumes and more complex queries.

Modern database management systems, both relational and non-relational, have incorporated various forms of caching to optimize performance. Caching techniques include result set caching, query plan caching, and page buffer caching. These caching mechanisms leverage the memory resources of the database server or distributed cache clusters to store frequently accessed data and query results, minimizing the need for costly disk I/O operations and reducing query execution times.

4.2.3 Design of caching

Caching effectiveness depends on factors such as cache size, cache eviction policies, and the access patterns of the data. An optimized caching strategy tailored to the specific use case and data access patterns is crucial to maximizing the benefits of caching and achieving optimal performance improvements.

Most of the caching effectiveness depends on the spatial and temporal locality of the data access patterns. For example:

LRU (Least Recently Use) Eviction Policy exploits the temporal locality of the access pattern by removing the least recently used item to make room for new entries. For improving the data structure efficiency, usually a Clock algorithm is used to simulate LRU with less memory overhead.
Low Inter-reference Recency Set (LIRS) policy improves LRU by better handling the scan workload such that when access pattern starts scanning the data set, the access does not kick the “real” hot elements from the cache (scan-resistence).
Pre-fetching is another caching strategy that proactively loads the data near the accessed data set. It exploits the fact that data close to each other are more likely to be accessed together, hence after pre-fetching, the data to be accessed next is more likely already been paged into the cache.

4.3 Metrics

Metrics provide useful tools for engineers to understand the current performance of a system, and allow reasoning about its strength and weakness, then find a path towards more optimal designs. Latency, throughput and availability are three important metrics when we evaluate a database system.

4.3.1 Latency

In a database system, latency refers to the time delay or the amount of time it takes for a query or operation to be executed and for the result to be returned. It is a measure of the response time or the delay experienced by the user or application when interacting with the database.

Latency in a database system can be caused by various factors, including:

Network. The time taken for data to travel between the client and the database server over the network can contribute to latency. Network latency depends on factors such as the distance between the client and server, network congestion, and the quality of the network infrastructure.
Disk I/O. When accessing data stored on disk, the time required to read or write data from or to the storage medium can impact latency. Disk I/O latency depends on factors such as disk speed, seek times, and the number of concurrent I/O operations.
Query Processing and Execution. The time taken by the database server to process and execute a query can contribute to latency. This includes parsing the query, optimizing the query execution plan, accessing the necessary data from storage, and performing any required computations or joins.

To measure latency in a database system, many database systems offer monitoring and performance analysis tools that provide insights into query execution times, wait times, I/O statistics, and other performance-related metrics. These tools can help identify bottlenecks and measure latency at various stages of query execution.

Caching reduces latency by storing frequently accessed data closer to the requesting entity. When data is already cached, it can be quickly retrieved from the cache, eliminating the need to access the original data source, which may involve disk I/O, network latency, or complex computations. By minimizing the time required to retrieve data, caching reduces the overall response time and improves latency.

4.3.2 Throughput

Throughput refers to the amount of work or the number of operations that can be processed within a given time period. It represents the system’s capacity to handle a certain volume of requests or transactions per unit of time.

Throughput is often measured in terms of transactions per second (TPS), queries per second (QPS), or operations per second (OPS). It reflects the system’s ability to efficiently process and execute database operations, such as reads, writes, updates, and queries.

To measure throughput in a database system, we can consider the following approaches:

Workload Testing: Perform workload testing by simulating a realistic or expected workload on the database system. This involves executing a set of representative queries or operations, with a specified concurrency level, and measuring the number of successful transactions or operations completed within a specific time frame. Workload testing tools or frameworks can help automate and streamline this process.
Performance Monitoring and Alerting: Use performance monitoring tools or features provided by the database system to collect metrics related to throughput. These tools often provide information about the number of queries executed, transactions processed, or operations completed per unit of time. By monitoring and analyzing these metrics, you can gain insights into the system’s throughput capacity and performance.
Load Testing: Conduct load testing by subjecting the database system to a higher-than-usual workload, gradually increasing the load to observe how the system handles it. Load testing tools can generate a high volume of concurrent requests or transactions to stress-test the system and measure its throughput under varying load conditions.

Caching enhances throughput by reducing the load on the underlying data storage or computation resources. When data is cached, subsequent requests for the same data can be served directly from the cache, without putting additional strain on the database system. By avoiding redundant and expensive operations, such as disk I/O or complex computations, caching frees up resources to handle other requests and increases the overall throughput capacity of the system.

Another way to look at this is, caching offloads the database system by reducing the number of queries or requests that need to be processed. As frequently accessed data is stored in the cache, the number of requests hitting the database decreases. This reduction in the database load allows the system to handle a higher volume of requests and improves the overall scalability of the database.

4.3.3 Availability

Database availability refers to the ability of a database system to remain accessible and operational, providing uninterrupted access to data and services. It ensures that users or applications can reliably access and interact with the database system whenever they need to, without experiencing significant downtime or disruptions.

High availability is a critical aspect of database systems, particularly in scenarios where downtime can lead to business disruptions, financial losses, or degraded user experiences. Database availability is often expressed as a percentage of uptime, such as “five nines availability” (99.999%), indicating a highly available system with very minimal downtime. Achieving high availability requires careful planning, robust infrastructure, redundancy, fault-tolerant design, and continuous monitoring and maintenance practices.

Key factors that contribute to database availability include:

System Redundancy: Implementing redundant components, such as multiple database servers, network connections, and storage systems, can help mitigate the impact of hardware failures or other infrastructure issues. Redundancy ensures that if one component fails, the system can seamlessly switch to a backup or standby component to maintain availability.
Fault Tolerance: Building fault-tolerant mechanisms into the database system helps it withstand and recover from failures gracefully. This includes features like automatic failover, replication, and clustering, which enable the system to continue operating even if individual components or nodes encounter issues.
Disaster Recovery: Having robust disaster recovery plans and mechanisms in place is essential for maintaining database availability during catastrophic events or major disruptions. This involves implementing off-site backups, replication to remote locations, and procedures for recovering the system and data in case of a disaster.
Monitoring and Alerting: Proactive monitoring and alerting systems help identify potential issues or performance degradation in real-time. Monitoring tools can track various metrics related to system health, performance, and availability, enabling administrators to detect and address issues promptly to minimize downtime.
Maintenance and Upgrades: Database systems require regular maintenance, patching, and upgrades. Ensuring availability during these activities is crucial. Techniques such as rolling upgrades, where components are upgraded one at a time without interrupting overall system operation, can help maintain availability during maintenance.

Caching may also contribute to availability by providing a fallback option when the original data source is temporarily unavailable or experiencing performance issues. If the cache holds a copy of the requested data, it can continue serving that data even if the underlying database is unavailable. This helps ensure uninterrupted access to critical data and improves the overall availability of the system.

5. Optimization in special applications - most relevant search

5.1 Turning use case to requirements

In the sections above, we have built our basic tools to understand database performance. However, in the ever growing problem space of web application, new use cases derive from complex, ever-changing ways human interact with a computer in modern applications. Imagine the scenario of a web user performing a semantic search on the internet. Instead of asking performance questions using the system language that we just acquired, they may inspect their own user experience and then ask the following questions:

Can I get the results faster if I do not mind out-of-date results as long as they are accurate?
What if I do not mind if a small percentage of results are not accurate?
Can I review the results before committing to running a full query?
Can I get the most relevant results and then decide if I want to continue the query?

Unlike well-defined metrics, these are case-specific performance requirements raised by a user. Application developers have to consider these requirements before they proceed to picking the technology to build a system. We call this step a use case analysis, which usually results into a product requirement.

A use case is a description or representation of a specific interaction or scenario that demonstrates how a system or software is used to accomplish a particular goal or provide value to its users. It outlines the steps, actions, and interactions between actors (users or external systems) and the system being analyzed or designed.

Use cases analysis is commonly required in software development, requirements engineering, and system analysis to capture and communicate functional requirements and system behavior. They help define and understand the various ways in which users or external entities interact with the system to achieve specific objectives.

To better meet the requirements for this use case, we need to first translate these questions the user raised into languages of system performance, build an abstract model to represent the goal we want to optimize, and address these problems using algorithmic and system technologies.

In this use case, we noticed that, while requiring query accuracy, the use mainly care about (1) Time till the first results, and (2) Time till the first a few results. This means to choose the data model and design the algorithms, we need to evaluate our solution using these two metrics, and decide which way to go.

Through this example of top relevant semantic search, we will learn how to apply models and techniques we learnt, to solve a real-world problem and improve our web-application’s user experience. Indexing and caching are still powerful tools, we still care about system latency and throughput, we just need to adjust the definition of them for this special use case. We will also see new trade-offs made to meet these use cases.

5.2 Representation and algorithms

5.2.1 RDF data model

Web data can be represented using the Resource Description Framework (RDF) model. RDF is a standard model for representing data on the web in a structured and machine-readable format. It provides a way to describe resources, their properties, and the relationships between them.

RDF represents data as triples, which consist of subject-predicate-object statements. The subject represents the resource being described, the predicate represents the property or attribute of the resource, and the object represents the value or another resource that the property refers to. This triple format allows for flexible and extensible representation of data.

Web data can be converted into RDF by mapping the existing data elements to RDF entities and properties. For example, consider a web page that contains information about books, including titles, authors, and publication dates. This data can be represented in RDF using triples, such as:

<Book1> <Title> “Introduction to RDF”. <Book1> <Author> “Xiaofeng Yang”. <Book1> <PublicationDate> “2023-2101-01”.

In this example, <Book1> represents the resource (book), <Title>, <Author>, and <PublicationDate> represent the properties, and the values are represented as literals.

RDF provides a uniform way to integrate and link data from different sources on the web. It enables the creation of knowledge graphs and supports semantic querying, reasoning, and inference. By representing web data in RDF, it becomes easier to combine, share, and reason over heterogeneous data from various domains.

Representing web data using the RDF model allows for more structured and interconnected data on the web, facilitating data integration, semantic search, and knowledge discovery.

5.2.2 Search for top results

Here are a few approaches to searching for top results in RDF:

SPARQL Queries. SPARQL (SPARQL Protocol and RDF Query Language) is a query language specifically designed for querying RDF data. SPARQL allows you to construct queries to retrieve specific information from RDF datasets. By formulating SPARQL queries with appropriate filters, sorting, and limit clauses, you can search for top results based on criteria such as property values, resource types, or other relevant parameters.
Full-Text Search. If the RDF data includes textual content, we can employ full-text search techniques to search for top results, by extracting the relevant textual information from the RDF triples and use full-text search engines or libraries, such as Apache Lucene [16] or Elasticsearch, to perform efficient keyword-based searches. These search engines can return ranked results based on relevance scores or other ranking algorithms.
Semantic Search. RDF allows for the representation of semantic relationships between resources. Semantic search techniques take into account the meaning and context of the data. This can involve utilizing ontologies or vocabularies to understand the relationships between resources and to infer connections. Semantic search can help retrieve top results by considering the semantics and relationships embedded within the RDF data.

Top-k keyword search algorithms for XML databases [17] combine semantic pruning based on document structure encoding with top-k join algorithms from relational databases. The main challenge lies in dealing with query semantics based on least common ancestors. RDF is a flexible and extensible way to represent information about World Wide Web resources.

Recent work on graph pattern matching in RDF databases has resulted in several different approaches to database design. However, as in the case of top-k query evaluation, it appears that more focus has been placed on scalability issues, such as replication, parallelization, and distribution of workloads [18, 19], as RDF datasets are often too large for a single machine. It is well known that SPARQL is descriptive enough to capture graph pattern matching queries (so-called basic graph patterns), and these queries are typically then decomposed into combinations of database primitive operations such as joins, unions, difference, etc.

5.2.3 Subgraph isomorphism problem

Searches on RDF requires efficient algorithms to search for subgraph on the data that is isomorphic to the graph representation of the query. Subgraph isomorphism is an NP-hard problem [20] in the size of subgraph, and state-of-art algorithms are not practical for large graphs. Lee [21] empirically compares the performance of several state-of-art subgraph isomorphism algorithms, including the Generic Subgraph Isomorphism Algorithm [22], Ullmann algorithm [23], VF2 [24], QuickSI [25], GADDI [26], and GraphQL [27]. They test on real-world datasets AIDS, NASA, Yeast and Human, covering a spectrum of relatively small graphs (with less than 1000 vertices). Modern social networks easily exceed that size by orders of magnitude, and the exact sub-graph isomorphism problem remains intractable for larger networks when label constraints and top-k are not fully exploited. Hence, to the best of our knowledge, none of these classic precise pattern matching algorithms could be used for our target application.

5.2.4 K-shortest simple paths

Similar to the subgraph isomorphism problem, the k-shortest paths problem is a natural and long-studied generalization of the shortest path problem, in which not one but several paths in increasing order of length are sought. The additional constraint of “simple” paths requires that the paths be without loops. This problem has been researched for directed graphs [28], undirected graphs [29], approximation algorithms [30], and algorithm engineering [31].

5.2.5 Anytime Top-k (Searches)

Finding the most important patterns plays a crucial role in web semantic search applications. What can we do to avoid running these NP-hard algorithms whose cost can easily grow exponentially as the query size? One way to go is to approximate the query results. However, remember that in our use case, we want to explore faster retrieval of top results while maintaining the complete correctness of query results. This lead to the path of employing new computational models that allow us to not compute all results.

Instead of paying the high cost of computing all results and then sorting them—we refer to that as full enumeration—the data analyst should be able to see the most relevant result “quickly”, closely followed by the second-ranked, third-ranked, and so on. This way they can stop evaluation as soon as they are satisfied, or to pause and restart as often as desired. This means less computational resources and a faster response exploratory analysis. This property of being able to see the top-ranked results quickly will allow the analyst to try out many different ranking measures, and observe their effect on the output. This type of process is called term any-k query, for it returns the k top-ranked results without requiring the analyst to specify k in advance.

In traditional top-k algorithms, setting how many result to compute is a challenge: setting k too large wastes resources for computing irrelevant results and setting it too low requires restart with a larger k, again wasting resources. Some top-k algorithms can be extended to support incremental computation beyond the specified k.

Researchers focuses on any-k functionality for conjunctive queries, because they represent a sweet spot in terms of computational tractability and wide applicability [32]. The class of conjunctive queries includes subgraph homomorphism as well as natural joins with selections and projections, also known as SPJ queries. Joins are an essential building block of queries in relational databases. Exciting recent developments for joins between more than two relations and general join hyper-graphs have renewed interest in their efficient evaluation. We consider the same type of join queries as this recent work.

In a database, a join is an operation that combines rows from two or more tables based on a related column between them. It allows us to retrieve data from multiple tables simultaneously by establishing a connection or relationship between them.

Here is an example join in SQL:

SELECT column1, column2, …
FROM table1.
INNER JOIN table2 ON table1.columnX = table2.columnY;

Joins are fundamental to relational databases and enable us to query and retrieve data that spans across multiple tables, providing a way to consolidate related information into a single result set. The most common type of join is an inner join, which returns only the rows that have matching values in both tables based on the specified join condition.

In order to evaluate the performance of an any-k algorithm, let us look at the following three measures:

Time-to-first (TTF): It returns the top-ranked result “quickly,”
Time-to-next (TTN): after returning a result, it returns the next result “quickly,” and
Time-to-last (TTL): it takes “not significantly longer” to return all answers than the best-known algorithms for full query computation and ranking.

Once we have the models to reason about the algorithm performance, we can design solutions that meet user requirements and measure them. Here we are not going into the theoretical foundation of these optimizations. For interested readers who wants to know more about how these results are achieved, see full proofs and performance characterization in [32]. The researchers have proposed an algorithm that is proven to be optimal in the worst case of RDF data input. This means by implementing the algorithm in the RDF database, the application engineers are guaranteed that Time-to-first (TTF), Time-to-next (TTN), Time-to-last (TTL) are significantly lower or comparable to all other algorithms.

6. Summary

In this special application, we went through the use case of interactive semantic searches on the web, and discussed the software engineering steps happened before the design and implementation phase, i.e. user study, key use case analysis, modeling the use case into an abstract problem.

Real-world applications come in all shapes and flavors, but we hope that the reader have gained some insight in how to approach a problem by disassembling it into smaller pieces of (1) use case study, (2) product requirement design, (3) system design, evaluation, then (4) deployment and maintenance. Once we have a formulation of the engineering problem at hand, there are lots of tools and technologies to choose from to solve well formed sub-problem in each step. With the user requirements in mind, we can evaluate all these tools using methodological perspectives, and find solutions that work the best for our use case.

Enjoy the journey!

References

1. Emerson S, L, Darnovsky M, Bowman J. The Practical SQL Handbook: Using Structured Query Language. Boston, USA: Addison-Wesley Longman Publishing Co., Inc; 1989
2. George S. Nosql–not only sql. International Journal of Enterprise Computing and Business Systems. 2013;2(2):2-5
3. Maule A, Emmerich W, David S. Rosenblum. Impact analysis of database schema changes. In: Proceedings of the 30th International Conference on Software Engineering. New York, United States: Association for Computing Machinery (ACM); 2008. pp. 451-460
4. Stonebraker M. Sql databases v. nosql databases. Communications of the ACM. 2010;53(4):10-11
5. Han J, Haihong E, Le G, Jian D. Survey on nosql database. In: 2011 6th International Conference on Pervasive Computing and Applications. Port Elizabeth, South Africa: IEEE; 2011. pp. 363-366
6. Lakshman A, Malik P. Cassandra: A decentralized structured storage system. ACM SIGOPS Operating Systems Review. 2010;44(2):35-40
7. Banker K, Garrett D, Bakkum P, Verch S. Mongo DB in Action: Covers Mongo DB Version 3.0. New York, United States: Simon and Schuster; 2016
8. Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, et al. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS). 2008;26(2):1-26
9. Miller JJ. Graph database applications and concepts with neo4j. In: Proceedings of the Southern Association for Information Systems Conference. Vol. 2324. Atlanta, GA, USA: Association for Information Science and Technology; 2013
10. Bach M, Werner A. Standardization of nosql database languages. In beyond databases, architectures, and structures. In: 10th International Conference, BDAS 2014, Ustron, Poland, May 27–30, 2014. Proceedings 10. Ustron, Poland: Springer; 2014. pp. 50-60
11. McGrath G, Brenner PR. Serverless computing: Design, implementation, and performance. In: 2017 IEEE 37th International Conference on Distributed Computing Systems Workshops (ICDCSW). Atlanta, GA, USA: IEEE; 2017. pp. 405-410
12. Varia J, Mathew S, et al. Overview of amazon web services. Amazon Web Services. 2014;105:5-11
13. Baron J, Kotecha S. Storage options in the aws cloud. In: Amazon web Services. Washington DC: Tech. Rep; 2013
14. Astrahan MM, Blasgen MW, Chamberlin DD, Eswaran KP, Gray JN, Griffiths PP, et al. System r: Relational approach to database management. ACM Transactions on Database Systems (TODS). 1976;1(2):97-137
15. Edgar Frank Codd. A relational model of data for large shared data banks. Communications of the ACM. 1983;26(1):64-69
16. Białecki A, Muir R, Ingersoll G, Imagination L. Apache lucene 4. In SIGIR 2012 Workshop on Open Source Information Retrieval. Vol. 17. New York, United States: Association for Computing Machinery (ACM); 2012
17. Chen LJ, Papakonstantinou Y. Supporting top-k keyword search in xml databases. In: Proc. ICDE. Piscataway, NJ, USA: IEEE Institute of Electrical and Electronics Engineers; 2010. pp. 689-700
18. Huang J, Abadi DJ, Ren K. Scalable SPARQL querying of large RDF graphs. PVLDB. 2011;4(11):1123-1134
19. Hose K, Schenkel R. WARP: Workload-aware replication and partitioning for RDF. In: Proc. ICDE Workshops. Piscataway, NJ, USA: IEEE Institute of Electrical and Electronics Engineers; 2013. pp. 1-6
20. Lubiw A. Some NP-complete problems similar to graph isomorphism. SIAM Journal on Computing. 1981;10(1):11-21
21. Lee J, Han W-S, Kasperovics R, Lee J-H. An in-depth comparison of subgraph isomorphism algorithms in graph databases. PVLDB. 2012;6(2):133-144
22. Cordella LP, Foggia P, Sansone C, Vento M. A (sub)graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2004;26(10):1367-1372
23. Ullmann JR. An algorithm for subgraph isomorphism. JACM. 1976;23(1):31-42
24. Cordella LP, Foggia P, Sansone C, Vento M. An improved algorithm for matching large graphs. In: Proc. IAPR-TC15 Workshop on Graph-Based Representations in Pattern Recognition. Piscataway, NJ, USA: IEEE Institute of Electrical and Electronics Engineers; 2001. pp. 149-159
25. Shang H, Zhang Y, Lin X, Yu JX. Taming verification hardness: An efficient algorithm for testing subgraph isomorphism. PVLDB. 2008;1(1):364-375
26. Zhang S, Li S, Yang J. Gaddi: Distance index based subgraph matching in biological networks. In: Proc. EDBT. New York, United States: Association for Computing Machinery (ACM); 2009. pp. 192-203
27. He H, Singh AK. Graphs-at-a-time: Query language and access methods for graph databases. In: Proc. ACM SIGMOD. New York, United States: Association for Computing Machinery (ACM); 2008. pp. 405-418
28. Yen JY. Finding the k shortest loopless paths in a network. Management Science. 1971;17(11):712-716
29. Katoh N, Ibaraki T, Mine H. An efficient algorithm for k shortest simple paths. Networks. 1982;12(4):411-427
30. Roditty L. On the K-simple shortest paths problem in weighted directed graphs. In: Proc. SODA. New York, United States: Association for Computing Machinery (ACM); 2007. pp. 920-928
31. Hershberger J, Maxel M, Suri S. Finding the k shortest simple paths: A new algorithm and its implementation. ACM Transactions on Algorithms. 2007;3(4):45-es
32. Tziavelis N, Ajwani D, Gatterbauer W, Riedewald M, Yang X. Optimal algorithms for ranked enumeration of answers to full conjunctive queries. In: Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases. Vol. 13. New York, United States: NIH Public Access; 2020. p. 1582

[1] 1. Emerson S, L, Darnovsky M, Bowman J. The Practical SQL Handbook: Using Structured Query Language. Boston, USA: Addison-Wesley Longman Publishing Co., Inc; 1989

[2] 2. George S. Nosql–not only sql. International Journal of Enterprise Computing and Business Systems. 2013;2(2):2-5

[3] 3. Maule A, Emmerich W, David S. Rosenblum. Impact analysis of database schema changes. In: Proceedings of the 30th International Conference on Software Engineering. New York, United States: Association for Computing Machinery (ACM); 2008. pp. 451-460

[4] 4. Stonebraker M. Sql databases v. nosql databases. Communications of the ACM. 2010;53(4):10-11

[5] 5. Han J, Haihong E, Le G, Jian D. Survey on nosql database. In: 2011 6th International Conference on Pervasive Computing and Applications. Port Elizabeth, South Africa: IEEE; 2011. pp. 363-366

[6] 6. Lakshman A, Malik P. Cassandra: A decentralized structured storage system. ACM SIGOPS Operating Systems Review. 2010;44(2):35-40

[7] 7. Banker K, Garrett D, Bakkum P, Verch S. Mongo DB in Action: Covers Mongo DB Version 3.0. New York, United States: Simon and Schuster; 2016

[8] 8. Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, et al. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS). 2008;26(2):1-26

[9] 9. Miller JJ. Graph database applications and concepts with neo4j. In: Proceedings of the Southern Association for Information Systems Conference. Vol. 2324. Atlanta, GA, USA: Association for Information Science and Technology; 2013

[10] 10. Bach M, Werner A. Standardization of nosql database languages. In beyond databases, architectures, and structures. In: 10th International Conference, BDAS 2014, Ustron, Poland, May 27–30, 2014. Proceedings 10. Ustron, Poland: Springer; 2014. pp. 50-60

[11] 11. McGrath G, Brenner PR. Serverless computing: Design, implementation, and performance. In: 2017 IEEE 37th International Conference on Distributed Computing Systems Workshops (ICDCSW). Atlanta, GA, USA: IEEE; 2017. pp. 405-410

[12] 12. Varia J, Mathew S, et al. Overview of amazon web services. Amazon Web Services. 2014;105:5-11

[13] 13. Baron J, Kotecha S. Storage options in the aws cloud. In: Amazon web Services. Washington DC: Tech. Rep; 2013

[14] 14. Astrahan MM, Blasgen MW, Chamberlin DD, Eswaran KP, Gray JN, Griffiths PP, et al. System r: Relational approach to database management. ACM Transactions on Database Systems (TODS). 1976;1(2):97-137

[15] 15. Edgar Frank Codd. A relational model of data for large shared data banks. Communications of the ACM. 1983;26(1):64-69

[16] 16. Białecki A, Muir R, Ingersoll G, Imagination L. Apache lucene 4. In SIGIR 2012 Workshop on Open Source Information Retrieval. Vol. 17. New York, United States: Association for Computing Machinery (ACM); 2012

[17] 17. Chen LJ, Papakonstantinou Y. Supporting top-k keyword search in xml databases. In: Proc. ICDE. Piscataway, NJ, USA: IEEE Institute of Electrical and Electronics Engineers; 2010. pp. 689-700

[18] 18. Huang J, Abadi DJ, Ren K. Scalable SPARQL querying of large RDF graphs. PVLDB. 2011;4(11):1123-1134

[19] 19. Hose K, Schenkel R. WARP: Workload-aware replication and partitioning for RDF. In: Proc. ICDE Workshops. Piscataway, NJ, USA: IEEE Institute of Electrical and Electronics Engineers; 2013. pp. 1-6

[20] 20. Lubiw A. Some NP-complete problems similar to graph isomorphism. SIAM Journal on Computing. 1981;10(1):11-21

[21] 21. Lee J, Han W-S, Kasperovics R, Lee J-H. An in-depth comparison of subgraph isomorphism algorithms in graph databases. PVLDB. 2012;6(2):133-144

[22] 22. Cordella LP, Foggia P, Sansone C, Vento M. A (sub)graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2004;26(10):1367-1372

[23] 23. Ullmann JR. An algorithm for subgraph isomorphism. JACM. 1976;23(1):31-42

[24] 24. Cordella LP, Foggia P, Sansone C, Vento M. An improved algorithm for matching large graphs. In: Proc. IAPR-TC15 Workshop on Graph-Based Representations in Pattern Recognition. Piscataway, NJ, USA: IEEE Institute of Electrical and Electronics Engineers; 2001. pp. 149-159

[25] 25. Shang H, Zhang Y, Lin X, Yu JX. Taming verification hardness: An efficient algorithm for testing subgraph isomorphism. PVLDB. 2008;1(1):364-375

[26] 26. Zhang S, Li S, Yang J. Gaddi: Distance index based subgraph matching in biological networks. In: Proc. EDBT. New York, United States: Association for Computing Machinery (ACM); 2009. pp. 192-203

[27] 27. He H, Singh AK. Graphs-at-a-time: Query language and access methods for graph databases. In: Proc. ACM SIGMOD. New York, United States: Association for Computing Machinery (ACM); 2008. pp. 405-418

[28] 28. Yen JY. Finding the k shortest loopless paths in a network. Management Science. 1971;17(11):712-716

[29] 29. Katoh N, Ibaraki T, Mine H. An efficient algorithm for k shortest simple paths. Networks. 1982;12(4):411-427

[30] 30. Roditty L. On the K-simple shortest paths problem in weighted directed graphs. In: Proc. SODA. New York, United States: Association for Computing Machinery (ACM); 2007. pp. 920-928

[31] 31. Hershberger J, Maxel M, Suri S. Finding the k shortest simple paths: A new algorithm and its implementation. ACM Transactions on Algorithms. 2007;3(4):45-es

[32] 32. Tziavelis N, Ajwani D, Gatterbauer W, Riedewald M, Yang X. Optimal algorithms for ranked enumeration of answers to full conjunctive queries. In: Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases. Vol. 13. New York, United States: NIH Public Access; 2020. p. 1582