User Tools

Site Tools


Sidebar

Navigation

Join us by

user mailing list
devel mailing list


More news...

RSS

main:faq

What is the usage of BlobSeer?

BlobSeer can work as backend storage for various applications such as Hadoop MapReduce or virtual machine image storage and deployment.

What is the context of BlobSeer?

  1. MapReduce paper, more French information MapReduce wikipedia

MapReduce is a programming model that allow user to run a task on a lot of machine, such as a “grep” command on huge data (>To). When a problem arrives on a node (i.e. physical machine), the principle is the use two functions: map and reduce.

  • Map is cutting one big problem is many little problem and delegate them to other nodes.
  • Reduce is pulling up results from leaf node to upper node.
  1. Hadoop tutorial, Hadoop is an open source implementation of mapReduce made by apache: this tuto explains how a simple program “WordCount1.0” counts word occurrences in 2 files.

A good blog to complete tuto mapRed

  • First, you need to transfer the data to hdfs file system (“hadoop dfs -put”).
  • Then launch mapReduce program (contained in jar file: “hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input/ /usr/joe/wordcount/output/” )

The map cut input data in fixed size chunk and delegate its to dataNode that will run job and retrieve result for result phase.

  • Finnaly get result (“hadoop dfs -cat /usr/joe/wordcount/input/part-00000”)
  1. Grid5000, Grid'5000 is a scientific instrument supporting experiment-driven research in all areas of computer science, including high performance computing, distributed computing, networking and big data.

here is the architecture of Grid5000: you can see those machines are distributed on many places in France. Note that you have to ask an account to use it.

You can do four tutorials in order to learn about grid5000, brief description of them:

  • Getting started explains you how to configure your ssh files and main commands.
  • First Step explains you how to reserve machines and to deploy jobs on it.
  • Cluster experiment OAR2 explains you how to access a Grid'5000 cluster, how to install data and how to run jobs and visualize them.
  • Grid job management explains you how to run a job on 3 machines from 3 different sites.

How to install BlobSeer?

To install BlobSeer, besides CMAKE, the three libraries are necessary:

  1. Boost, a collection of free peer-reviewed portable C++ source libraries (Boost 0.40 and above)
  2. Libconfig, a simple library for manipulating structured configuration files for C/C++
  3. Berkley DB, a transactional embedded key/value data store

A quick setup guide is given in the tutorial.

Common problems in installation

Compilation errors about BOOST?

If you compilation problem about BOOST, please make sure that the version of BOOST is larger than 1.51.0.

How to deploy BlobSeer?

BlobSeer can be deployed both manually or automatically.

For a manual deployment, build a configuration file from the template configuration file or the sample configuration file provided above, then run each of the required processes:

$INSTALL_DIR/bin/vmanager localhost-config-file.cfg
$INSTALL_DIR/bin/pmanager localhost-config-file.cfg
$INSTALL_DIR/bin/provider localhost-config-file.cfg
$INSTALL_DIR/bin/sdht localhost-config-file.cfg

For automatic deployment, we provider several scripts to deploy BlobSeer on local machine, Grid5000 platform or a general cluster. More information about how to use the scripts can be found in the following tutorial.

Problem about Locale Names

On certain operating system, such as Fedora and Red Hat, you may meet “locale name” problems when you are launching data providers. The Boost program may throw the exception as:

terminate called after throwing an instance of 'std::runtime_error'
  what():  locale::facet::_S_create_c_locale name not valid
  

This can be fixed to set the LC_CTYPE to empty by export LC_CTYPE=.

Glossary

SVN

main/faq.txt · Last modified: 2014/12/17 09:29 (external edit)