User Tools

Site Tools


Sidebar

Navigation

Join us by

user mailing list
devel mailing list


More news...

RSS

about_blobseer

What is BlobSeer

BlobSeer is a large-scale distributed storage service that addresses advanced data management requirements resulting from ever-increasing data sizes. It is centered around the idea of leveraging versioning for concurrent manipulation of binary large objects in order to efficiently exploit data-level parallelism and sustain a high throughput despite massively parallel data access.

Features include:

  • support to store large binary large objects (BLOBs) that reach the order of TB
  • fine grain access (e.g., in the order of MB)
  • versioning: each write generates a new fully independent snapshot of the blob; all past snapshots are accessible
  • data and metadata decentralization
  • high throughput under heavy access concurrency in any combination: read/read, read/write, write/write.

Why BlobSeer

Let's consider applications that process huge amounts of data that are distributed at very large scale. To facilitate data management in such conditions, a suitable approach is to organize data as a set of huge objects. Such objects (called BLOBs hereafter, for Binary Large OBjects), consist of long sequences of bytes representing unstructured data and may serve as a basis for transparent data sharing at large-scale. A BLOB can typically reach sizes of up to 1 TB. Using a BLOB to represent data has two main advantages:

  • Scalability. Applications that deal with fast growing datasets that easily reach the order of TB and beyond can scale better, because maintaining a small set of huge BLOBs comprising billions of small, KB-sized application-level objects is much more feasible than managing billions of small KB-sized les directly. Even if there was a distributed file system that would support access to such small files transparently and eciently, the simple mapping of application-level objects to le names can incur a prohibitively high overhead compared to the solution where the objects are stored in the same BLOB and only their o sets need to be maintained.
  • Transparency. A data-management system relying on globally shared BLOBs uniquely identifi ed in the system through global ids facilitate easier application development by freeing the developer from the burden of managing data locations and data transfers explicitly in the application.

BlobSeer and Hadoop

  • Hadoop uses only one nameNode that can localize data in HDFS.

In the situation of many processes accessing to one file we can get some problem or slowness. BlobSeer introduce a nameNode dispatched, so concurrency access to one file is better.

  • Hadoop does not allow to use versioning and to update same file concurrently.

BlobSeer introduce BSFS (BlobSeer File System) that allow to use versioning, so each user updating a file is using a unique version of its. BSFS presents a similar API as HDFS that can be used by Hadoop, but the versioning is not used in this case.

Architecture

BlobSeer consists of a series of distributed communicating processes. The above figure illustrates the processes and the interactions between them.

Clients create, read, write and append data from/to BLOBs. A large number of concurrent clients is expected, and they may all access the same BLOB.

Data (storage) providers physically store the chunks generated by appends and writes. New data providers may dynamically join and leave the system.

The provider manager keeps information about the available storage space and schedules the placement of newly generated chunks. It employs a confi gurable chunk distribution strategy to maximize the data distribution benefi ts with respect to the needs of the application.

Metadata (storage) providers physically store the metadata that allows identifying the chunks that make up a snapshot version. A distributed metadata management scheme is employed to enhance concurrent access to metadata.

The version manager is in charge of assigning new snapshot version numbers to writers and appenders and to reveal these new snapshots to readers. It is done so as to off er the illusion of instant snapshot generation and to satisfy the guarantees.

about_blobseer.txt · Last modified: 2014/12/17 09:29 (external edit)