BlobSeer is a large-scale distributed storage service that addresses advanced data management requirements resulting from ever-increasing data sizes. It is centered around the idea of leveraging versioning for concurrent manipulation of binary large objects in order to efficiently exploit data-level parallelism and sustain a high throughput despite massively parallel data access.
Let's consider applications that process huge amounts of data that are distributed at very large scale. To facilitate data management in such conditions, a suitable approach is to organize data as a set of huge objects. Such objects (called BLOBs hereafter, for Binary Large OBjects), consist of long sequences of bytes representing unstructured data and may serve as a basis for transparent data sharing at large-scale. A BLOB can typically reach sizes of up to 1 TB. Using a BLOB to represent data has two main advantages:
In the situation of many processes accessing to one file we can get some problem or slowness. BlobSeer introduce a nameNode dispatched, so concurrency access to one file is better.
BlobSeer introduce BSFS (BlobSeer File System) that allow to use versioning, so each user updating a file is using a unique version of its. BSFS presents a similar API as HDFS that can be used by Hadoop, but the versioning is not used in this case.
BlobSeer consists of a series of distributed communicating processes. The above figure illustrates the processes and the interactions between them.
Clients create, read, write and append data from/to BLOBs. A large number of concurrent clients is expected, and they may all access the same BLOB.
Data (storage) providers physically store the chunks generated by appends and writes. New data providers may dynamically join and leave the system.
The provider manager keeps information about the available storage space and schedules the placement of newly generated chunks. It employs a configurable chunk distribution strategy to maximize the data distribution benefits with respect to the needs of the application.
Metadata (storage) providers physically store the metadata that allows identifying the chunks that make up a snapshot version. A distributed metadata management scheme is employed to enhance concurrent access to metadata.
The version manager is in charge of assigning new snapshot version numbers to writers and appenders and to reveal these new snapshots to readers. It is done so as to offer the illusion of instant snapshot generation and to satisfy the guarantees.