Key-Value DB Load Balancer

Database, Parallel Computing, Load Balance

Download as .zip Download as .tar.gz View on GitHub

Proposal

Summary

We will create a load balancer for distributed key-value database as back-end. The database nodes will be used in this project is Emerald DB.

Background

Key-value (KV) stores use the associative array as their fundamental data model. In this model, data is represented as a collection of key-value pairs, such that each possible key appears at most once in the collection. Key-value database can provide us with many desirable attributes that relational database cannot provide, such as massive write performance, fast key-value access, flexible schema, no single point of failure, etc. Some big companies like Google and Yahoo use key-value database a lot and they often use several copies of one piece of data stored to obtain availability, reliability and parallel accessing. And sometimes for locality reason, they may move some copies from one database to another for locality reason. In this scenario, how to schedule data accessing among all these copies becomes a real-world problem and also the key to improve throughput.

Challenge

The challenging part of this project is to correctly balance the request to the existing data copies. This requires us to get the real-time information of whether one database is overloaded or underloaded. Since we will considering the scenario that the locations of each data copies is not constant, which means that we will move certain data copies for load balancing reason. This post new challenge like where to place the data copies and how to adjust our scheduling policy. One of the biggest challenge is to ensure data consistency across different copies. There are already many algorithm which is specialized to handle the consistency problem, but trying to selecting most proper one to fit into our system is still challenging.

Resources

Our system will based on machines that has linux distribution installed. At the beginning, we could consider creating virtual machines to provide experiment environment. In the last step, we hope we can get access to 4 to 10 qualified machines in school to test scalability and availability of our system.

Goals

The project has four main areas: implement load balancer, implement data movement mechanism, ensure data consistency and data access pattern related benchmark.

Platform

We plan to implement our load balancer using Java under Linux environment. And the database we use is a document, mongodb like database. The reason we choose this database is
  • Emerald DB is a pure and lightweight key-value database which makes it easy for us to integrate our own optimization.
  • This database is easy to make copie so we can simulate accessing multiple data copies.
  • Since it is very light-weight and we have gotten permission on code change. It is possible for us to modify this database and provide functionality that is necessary to our load-balancer
  • Schedule