User Tools

Site Tools


Sidebar

Navigation

Join us by

user mailing list
devel mailing list


More news...

RSS

tutorial:hb

BlobSeer File System - Hadoop Backend

As a large-scale distributed storage service provider, BlobSeer can perform as the storage backend of various data intensive cloud applications. The BlobSeer File System (BSFS), an intermediate layer between BlobSeer and Hadoop application, enables BlobSeer to offer its advanced storage system to MapReduce framework so as to optimize the efficiency of Hadoop. It was designed and implemented by Diana Moise during her thesis from 2008 to 2011 in team Kerdata, INIRA, Rennes.

How to use on Local Machine

The current BSFS is compatible with BlobSeer-1.2.1 and Hadoop-1.2.1. Before we continue with this tutorial, the user should be sure that BlobSeer-1.2.1 is correctly installed in the machine that will run the Hadoop/BSFS experiments.

Step 1: Download and Compile

Download the source code of BSFS from INRIA Gforge BSFS-1.2.2, as well as Hadoop-1.2.1. Copy the folder BSFS-1.2.2 extracted from BSFS-1.2.2.zip to the repository <hadoop-dir>/src/core/org/apache/hadoop/fs/ and change its name to bsfs. This part of the program will work as the file system client.

(rm previous if necessary: rm <hadoop-dir>/src/core/org/apache/hadoop/fs/bsfs)
$ cp -r BSFS-1.2.2 <hadoop-dir>/src/core/org/apache/hadoop/fs/bsfs

Copy java interface to BlobSeer into Hadoop source code. The java interface can be found in <blobseer_dir>/bindings/java/blobseer.

(rm previous if necessary: rm <hadoop-dir>/src/core/blobseer)
$ cp -r <blobseer-dir>/bindings/java/blobseer <hadoop-dir>/src/core/blobseer

Two small modifications are necessary. First one is in the general build file of hadoop. Open <hadoop-dir>/build.xml, then add the following patch:

@@ -690,10 +690,9 @@
       <env key="HADOOP_NATIVE_SRCDIR" value="${native.src.dir}"/>
   </exec> 
-    <exec dir="${build.native}" executable="sh" failonerror="true">
-      <arg line="${build.native}/libtool --mode=install cp ${build.native}/libhadoop.la ${build.native}/lib"/>
+    <exec dir="${build.native}" executable="${build.native}/libtool" failonerror="true">
+      <arg line="--mode=install cp ${build.native}/libhadoop.la ${build.native}/lib"/>
   </exec>
 </target>

The second modification is in the file <hadoop-dir>/src/core/org/apache/hadoop/fs/permission/FsPermission.java. Since BSFS uses Serializable interface to transfer client requests, the class FsPermission, which is used to store the file permission information, should also be serializable.

@@ +20,6 +20,7 @@
+ import java.io.Serializable;
@@ -31,8 +32,8 @@
- public class FsPermission implements Writable{
- private static final Log LOG = LogFactory.getLog(FsPermission.class);
+ public class FsPermission implements Writable, Serializable{
+ //private static final Log LOG = LogFactory.getLog(FsPermission.class);
@@ -189,9 +190,9 @@
- LOG.warn(DEPRECATED_UMASK_LABEL + " configuration key is deprecated. " +
-     "Convert to " + UMASK_LABEL + ", using octal or symbolic umask " +
-     "specifications.");
+ /*LOG.warn(DEPRECATED_UMASK_LABEL + " configuration key is deprecated. " +
+     "Convert to " + UMASK_LABEL + ", using octal or symbolic umask " +
+     "specifications.");*/

These modifications can be made using the patch file BSFS-1.2.1.patch in the folder BSFS-1.2.1. Please use it at the up-level directory of the '<hadoop-dir>'.

Before the compilation, please make sure that the following packages and softwares are installed:

  • ant
  • zlib-1.2.5
  • java-6-sun (or plus)
  • libtool
  • autoconf
  • automake

Then, export the environment variables:

$ export LDFLAGS=-Wl,--no-as-needed
$ export JAVA_HOME=<java-dir>
$ export HADOOP_HOME=<hadoop-dir>
$ export LD_LIBRARY_PATH="$HADOOP_HOME/blobseer-fsmeta/lib:$LD_LIBRARY_PATH"
export ANT_OPTS="-Dhttp.proxyHost=proxy -Dhttp.proxyPort=3128"

Now we are able to compile Hadoop with BSFS client:

$ ant && ant compile-native

After that, each time the Hadoop code is modified, only one compile command is necessary:

$ ant compile-core

While the last compilation integrates Hadoop with the file system client, the following compilation build the standalone namespace manager.

Copy the NManager source code in BSFS-1.2.2/NManager to <hadoop-dir>/blobseer-fsmeta, change the environment variables in build.xml: HADOOP_HOME=<hadoop-dir>, SERVER_HOME=$HADOOP_HOME/blobseer-fsmeta, and build it.

$ cp -r BSFS-1.2.2/NManager <hadoop-dir>/blobseer-fsmeta
$ cd <hadoop-dir>/blobseer-fsmeta
$ ant build

Copy BlobSeer libraries into both file system client and namespace manager.

$ mkdir <hadoop-dir>/blobseer-fsmeta/lib
$ cp <blobseer-dir>/bindings/java/libblobseer-java.so <hadoop-dir>/blobseer-fsmeta/lib
$ cp <blobseer-dir>/lib/libblobseer* <hadoop-dir>/blobseer-fsmeta/lib
$ cp <hadoop-dir>/blobseer-fsmeta/lib/libb* <hadoop-dir>/build/native/Linux-<arch>/lib

Till now, the compilation of Hadoop plus BSFS is finished; we can change name of folder:

mv hadoop-1.2.1 hadoop-bsfs-1.2.1

We can do the same for hadoop-1.2.1

tar -xzvf hadoop-1.2.1.tar.gz
cd hadoop-1.2.1
ant && ant compile-native  

Step 2: Configuration

Same as using the original Hadoop MapReduce application, it is necessary to fill some configuration files. First of all, do not forget to export the JAVA_HOME in <hadoop-dir>/hadoop-env.sh. Then, we need to configure the core-site.xml and mapred-site.xml. Just replace the core-site.xml and mapred-site.xml in <hadoop-dir>/conf/ with the core-site.xml and mapred-site.xml in BSFS-1.2.1.

$ cp BSFS-1.2.2/core-site.xml <hadoop-dir>/conf
$ cp BSFS-1.2.2/mapred-site.xml <hadoop-dir>/conf

The value given in “fs.bsfs.page.size” is multiplied by “fs.bsfs.page.size” to be used as “pageSize” in blobSeer (createBlob function).

Step3: Deploy and Test

We now arrive at the stage to deploy Hadoop plus BlobSeer, and test it with simple MapReduce applications. Of course, we should deploy BlobSeer first. Then, we can launch BSFS on top of BlobSeer. Enter the blobseer-fsmeta repository and type the following command.

$ java -cp build/:../build/classes/:lib/ NManager 9000 <BlobSeer configuration file> &

–>with <BlobSeer configuration file> = /tmp/blobseer.cfg which is created by doing a ./<blobseerDir>/scripts/local-deploy.sh

After that, BlobSeer can be accessed by Hadoop through BSFS for reading and writting data. For example, we can write the CHANGES.txt file into BlobSeer.

$ <hadoop-dir>/bin/hadoop fs -copyFromLocal <hadoop-dir>/CHANGES.txt test.txt

Then, check the state of the file system.

$ <hadoop-dir>/bin/hadoop fs -ls

If BSFS works correctly, following content should be displayed in the terminal.

$ Found 1 items
$ -rwxr-xr-x   1 user supergroup       446999 2013-04-04 17:54 /test.txt

To start JobTracker and TaskTrackers, we need to manually create two system repositories to store logs and intermediate files of MapReduce jobs.

$ <hadoop-dir>/bin/hadoop fs -mkdir /tmp/hadoop-$USER/mapred/system

Then, launch JobTracker and TaskTrackers.

$ <hadoop-dir>/bin/start-mapred.sh

To check whether all components are successfully launched, just type 'jps'. The result should be all the running java processes.

$ 15080 JobTracker
$ 15295 TaskTracker
$ 16364 Jps
$ 14958 NManager

If all the three processes are found, it means the Hadoop built on BlobSeer is ready to execute MapReduce tasks. Till now, Hadoop plus BlobSeer is able to tackle two simple tasks: 'wordcount' and 'sort'. Following are the command to start the two example jobs.

  • 'wordcount':
$ <hadoop-dir>/bin/hadoop jar hadoop*examples*.jar wordcount test.txt /output-wordcount
  • 'sort':
$ <hadoop-dir>/bin/hadoop jar hadoop*examples*.jar sort -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat -outFormat org.apache.hadoop.mapred.TextOutputFormat -outKey org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text test.txt output-sort

The results should be stored in the corresponding output directory in BSFS. To check the results, use the '-cat' command.

$ <hadoop-dir>/bin/hadoop fs -cat <output-wordcount>/part-r-00000

How to Use on Grid5000

Since Grid5000 does not host the libraries that are necessary for the compilation, the Hadoop project should be built locally and copied to Grid5000. Please note that the local compilation environment should be the same as Grid5000: Linux-amd64-64.

Copy all the compiled programs including BSFS client and Namespace Manager to the front-end of Grid5000. Overwrite the links to BlobSeer by those links compiled on Grid5000, otherwise, BSFS will not be able to find BlobSeer running on Grid5000:

$ cp <blobseer-dir-on-g5k>/bindings/java/lib/libblobseer-java.so <hadoop-dir>/blobseer-fsmeta/lib
$ cp <blobseer-dir-on-g5k>/lib/libblobseer* <hadoop-dir>/blobseer-fsmeta/lib
$ cp <hadoop-dir>/blobseer-fsmeta/lib/libb* <hadoop-dir>/build/native/Linux-<arch>/lib

Thereafter, we need to slightly modify the configuration file 'core-site.xml' in '<hadoop-dir>/conf':

- line 20: <value>/tmp/blobseer.cfg</value>
+ line 20: <value><your BlobSeer configuration file directory></value>
- line 26: <value>bsfs://localhost:9000</value>
+ line 26: <value><bsfs:your namespace manager's IP address>:<port></value>
 

Do not forget to put the name of master node in file '<hadoop-dir>/conf/master' and slave nodes in file '<hadoop-dir>/conf/slaves':

griffon-2.nancy.grid5000.fr
griffon-30.nancy.grid5000.fr
griffon-67.nancy.grid5000.fr
...

Of course we need to deploy BlobSeer before we launch BSFS. How to deploy BlobSeer on Grid5000? The answer is given here.

Till now, we are ready to launch BSFS with Hadoop on Grid5000. Just repeat the Step 3 for local deployment, Hadoop with BSFS will run on Grid5000. If you do not want to get your hands dirty, some scripts are also provided to facilitate the deployment process. These scripts are in the folder 'hadoop-bsfs-deploy-script', which is also compressed in 'BSFS-1.2.1.zip'.

The main program of the scripts are 'hb-single-clustest.sh'. To run it, two input parameters are obligatory: the location of the file contains environment variable configuration, and the name of job on Grid5000. The details of options can be shown using the '-h' option.

Currently, we put the essential environment variables in the file 'env.sh'. The variables are:

export SCRIPT_HOME=<location of hb-single-clustest.sh>
export BLOBSEER_HOME=<blobseer-dir>
export HADOOP_HOME=<hadoop=dir>
export LD_LIBRARY_PATH=<$HADOOP_HOME/blobseer-fsmeta/lib:$LD_LIBRARY_PATH>
export BSFS_SERVER_HOME=<$HADOOP_HOME/blobseer-fsmeta>

Now, just simply type:

$ ./hb-single-clustest.sh -e env.sh -n <g5k job name>

The Hadoop on BSFS will be deployed automatically.

To clean up the deployement, use 'hb-clean.sh'.

$ ./hb-clean.sh -e env.sh

Furthermore, we also provide the script that deploys Hadoop with HDFS on Grid5000. It is in the folder 'hadoop-deploy-script' in the same achieve 'BSFS-1.2.1.zip'. It is very easy to use. There are two scripts in the folder, one for deployment, the other for cleaning up the deployment. For deployment, simply run 'deploy-hadoop.sh' with two parameters: hadoop home folder and Grid5000 job name.

$ ./deploy-hadoop.sh -o <hadoop-dir> -n <job name>

For more information about options, please use the '-h/–help' option.

To clean up the deployment, run 'clean-hadoop.sh' with the parameter indicating hadoop home folder.

$ ./clean-hadoop.sh -o <hadoop-dir>
tutorial/hb.txt · Last modified: 2015/01/13 14:38 by lcloatre