As a large-scale distributed storage service provider, BlobSeer can perform as the storage backend of various data intensive cloud applications. The BlobSeer File System (BSFS), an intermediate layer between BlobSeer and Hadoop application, enables BlobSeer to offer its advanced storage system to MapReduce framework so as to optimize the efficiency of Hadoop. It was designed and implemented by Diana Moise during her thesis from 2008 to 2011 in team Kerdata, INIRA, Rennes.
The current BSFS is compatible with BlobSeer-1.2.1 and Hadoop-1.2.1. Before we continue with this tutorial, the user should be sure that BlobSeer-1.2.1 is correctly installed in the machine that will run the Hadoop/BSFS experiments.
Download the source code of BSFS from INRIA Gforge BSFS-1.2.2, as well as Hadoop-1.2.1. Copy the folder BSFS-1.2.2
extracted from BSFS-1.2.2.zip
to the repository <hadoop-dir>/src/core/org/apache/hadoop/fs/
and change its name to bsfs
. This part of the program will work as the file system client.
(rm previous if necessary: rm <hadoop-dir>/src/core/org/apache/hadoop/fs/bsfs) $ cp -r BSFS-1.2.2 <hadoop-dir>/src/core/org/apache/hadoop/fs/bsfs
Copy java interface to BlobSeer into Hadoop source code. The java interface can be found in <blobseer_dir>/bindings/java/blobseer
.
(rm previous if necessary: rm <hadoop-dir>/src/core/blobseer) $ cp -r <blobseer-dir>/bindings/java/blobseer <hadoop-dir>/src/core/blobseer
Two small modifications are necessary. First one is in the general build file of hadoop. Open <hadoop-dir>/build.xml
, then add the following patch:
@@ -690,10 +690,9 @@ <env key="HADOOP_NATIVE_SRCDIR" value="${native.src.dir}"/> </exec> - <exec dir="${build.native}" executable="sh" failonerror="true"> - <arg line="${build.native}/libtool --mode=install cp ${build.native}/libhadoop.la ${build.native}/lib"/> + <exec dir="${build.native}" executable="${build.native}/libtool" failonerror="true"> + <arg line="--mode=install cp ${build.native}/libhadoop.la ${build.native}/lib"/> </exec> </target>
The second modification is in the file <hadoop-dir>/src/core/org/apache/hadoop/fs/permission/FsPermission.java
. Since BSFS uses Serializable
interface to transfer client requests, the class FsPermission
, which is used to store the file permission information, should also be serializable.
@@ +20,6 +20,7 @@ + import java.io.Serializable;
@@ -31,8 +32,8 @@ - public class FsPermission implements Writable{ - private static final Log LOG = LogFactory.getLog(FsPermission.class); + public class FsPermission implements Writable, Serializable{ + //private static final Log LOG = LogFactory.getLog(FsPermission.class);
@@ -189,9 +190,9 @@ - LOG.warn(DEPRECATED_UMASK_LABEL + " configuration key is deprecated. " + - "Convert to " + UMASK_LABEL + ", using octal or symbolic umask " + - "specifications."); + /*LOG.warn(DEPRECATED_UMASK_LABEL + " configuration key is deprecated. " + + "Convert to " + UMASK_LABEL + ", using octal or symbolic umask " + + "specifications.");*/
These modifications can be made using the patch file BSFS-1.2.1.patch
in the folder BSFS-1.2.1
. Please use it at the up-level directory of the '<hadoop-dir>'.
Before the compilation, please make sure that the following packages and softwares are installed:
Then, export the environment variables:
$ export LDFLAGS=-Wl,--no-as-needed $ export JAVA_HOME=<java-dir> $ export HADOOP_HOME=<hadoop-dir> $ export LD_LIBRARY_PATH="$HADOOP_HOME/blobseer-fsmeta/lib:$LD_LIBRARY_PATH" export ANT_OPTS="-Dhttp.proxyHost=proxy -Dhttp.proxyPort=3128"
Now we are able to compile Hadoop with BSFS client:
$ ant && ant compile-native
After that, each time the Hadoop code is modified, only one compile command is necessary:
$ ant compile-core
While the last compilation integrates Hadoop with the file system client
, the following compilation build the standalone namespace manager
.
Copy the NManager
source code in BSFS-1.2.2/NManager
to <hadoop-dir>/blobseer-fsmeta
, change the environment variables in build.xml
: HADOOP_HOME=<hadoop-dir>
, SERVER_HOME=$HADOOP_HOME/blobseer-fsmeta
, and build it.
$ cp -r BSFS-1.2.2/NManager <hadoop-dir>/blobseer-fsmeta $ cd <hadoop-dir>/blobseer-fsmeta $ ant build
Copy BlobSeer libraries into both file system client
and namespace manager
.
$ mkdir <hadoop-dir>/blobseer-fsmeta/lib $ cp <blobseer-dir>/bindings/java/libblobseer-java.so <hadoop-dir>/blobseer-fsmeta/lib $ cp <blobseer-dir>/lib/libblobseer* <hadoop-dir>/blobseer-fsmeta/lib $ cp <hadoop-dir>/blobseer-fsmeta/lib/libb* <hadoop-dir>/build/native/Linux-<arch>/lib
Till now, the compilation of Hadoop plus BSFS is finished; we can change name of folder:
mv hadoop-1.2.1 hadoop-bsfs-1.2.1
We can do the same for hadoop-1.2.1
tar -xzvf hadoop-1.2.1.tar.gz cd hadoop-1.2.1 ant && ant compile-native
Same as using the original Hadoop MapReduce application, it is necessary to fill some configuration files. First of all, do not forget to export the JAVA_HOME
in <hadoop-dir>/hadoop-env.sh
. Then, we need to configure the core-site.xml
and mapred-site.xml
. Just replace the core-site.xml
and mapred-site.xml
in <hadoop-dir>/conf/
with the core-site.xml
and mapred-site.xml
in BSFS-1.2.1
.
$ cp BSFS-1.2.2/core-site.xml <hadoop-dir>/conf $ cp BSFS-1.2.2/mapred-site.xml <hadoop-dir>/conf
The value given in “fs.bsfs.page.size” is multiplied by “fs.bsfs.page.size” to be used as “pageSize” in blobSeer (createBlob function).
We now arrive at the stage to deploy Hadoop plus BlobSeer, and test it with simple MapReduce applications. Of course, we should deploy BlobSeer first. Then, we can launch BSFS on top of BlobSeer. Enter the blobseer-fsmeta
repository and type the following command.
$ java -cp build/:../build/classes/:lib/ NManager 9000 <BlobSeer configuration file> &
–>with <BlobSeer configuration file> = /tmp/blobseer.cfg which is created by doing a ./<blobseerDir>/scripts/local-deploy.sh
After that, BlobSeer can be accessed by Hadoop through BSFS for reading and writting data. For example, we can write the CHANGES.txt
file into BlobSeer.
$ <hadoop-dir>/bin/hadoop fs -copyFromLocal <hadoop-dir>/CHANGES.txt test.txt
Then, check the state of the file system.
$ <hadoop-dir>/bin/hadoop fs -ls
If BSFS works correctly, following content should be displayed in the terminal.
$ Found 1 items $ -rwxr-xr-x 1 user supergroup 446999 2013-04-04 17:54 /test.txt
To start JobTracker and TaskTrackers, we need to manually create two system repositories to store logs and intermediate files of MapReduce jobs.
$ <hadoop-dir>/bin/hadoop fs -mkdir /tmp/hadoop-$USER/mapred/system
Then, launch JobTracker and TaskTrackers.
$ <hadoop-dir>/bin/start-mapred.sh
To check whether all components are successfully launched, just type 'jps'. The result should be all the running java processes.
$ 15080 JobTracker $ 15295 TaskTracker $ 16364 Jps $ 14958 NManager
If all the three processes are found, it means the Hadoop built on BlobSeer is ready to execute MapReduce tasks. Till now, Hadoop plus BlobSeer is able to tackle two simple tasks: 'wordcount' and 'sort'. Following are the command to start the two example jobs.
$ <hadoop-dir>/bin/hadoop jar hadoop*examples*.jar wordcount test.txt /output-wordcount
$ <hadoop-dir>/bin/hadoop jar hadoop*examples*.jar sort -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat -outFormat org.apache.hadoop.mapred.TextOutputFormat -outKey org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text test.txt output-sort
The results should be stored in the corresponding output directory in BSFS. To check the results, use the '-cat' command.
$ <hadoop-dir>/bin/hadoop fs -cat <output-wordcount>/part-r-00000
Since Grid5000 does not host the libraries that are necessary for the compilation, the Hadoop project should be built locally and copied to Grid5000. Please note that the local compilation environment should be the same as Grid5000: Linux-amd64-64.
Copy all the compiled programs including BSFS client and Namespace Manager to the front-end of Grid5000. Overwrite the links to BlobSeer by those links compiled on Grid5000, otherwise, BSFS will not be able to find BlobSeer running on Grid5000:
$ cp <blobseer-dir-on-g5k>/bindings/java/lib/libblobseer-java.so <hadoop-dir>/blobseer-fsmeta/lib $ cp <blobseer-dir-on-g5k>/lib/libblobseer* <hadoop-dir>/blobseer-fsmeta/lib $ cp <hadoop-dir>/blobseer-fsmeta/lib/libb* <hadoop-dir>/build/native/Linux-<arch>/lib
Thereafter, we need to slightly modify the configuration file 'core-site.xml' in '<hadoop-dir>/conf':
- line 20: <value>/tmp/blobseer.cfg</value> + line 20: <value><your BlobSeer configuration file directory></value> - line 26: <value>bsfs://localhost:9000</value> + line 26: <value><bsfs:your namespace manager's IP address>:<port></value>
Do not forget to put the name of master node in file '<hadoop-dir>/conf/master' and slave nodes in file '<hadoop-dir>/conf/slaves':
griffon-2.nancy.grid5000.fr griffon-30.nancy.grid5000.fr griffon-67.nancy.grid5000.fr ...
Of course we need to deploy BlobSeer before we launch BSFS. How to deploy BlobSeer on Grid5000? The answer is given here.
Till now, we are ready to launch BSFS with Hadoop on Grid5000. Just repeat the Step 3 for local deployment, Hadoop with BSFS will run on Grid5000. If you do not want to get your hands dirty, some scripts are also provided to facilitate the deployment process. These scripts are in the folder 'hadoop-bsfs-deploy-script', which is also compressed in 'BSFS-1.2.1.zip'.
The main program of the scripts are 'hb-single-clustest.sh'. To run it, two input parameters are obligatory: the location of the file contains environment variable configuration, and the name of job on Grid5000. The details of options can be shown using the '-h' option.
Currently, we put the essential environment variables in the file 'env.sh'. The variables are:
export SCRIPT_HOME=<location of hb-single-clustest.sh> export BLOBSEER_HOME=<blobseer-dir> export HADOOP_HOME=<hadoop=dir> export LD_LIBRARY_PATH=<$HADOOP_HOME/blobseer-fsmeta/lib:$LD_LIBRARY_PATH> export BSFS_SERVER_HOME=<$HADOOP_HOME/blobseer-fsmeta>
Now, just simply type:
$ ./hb-single-clustest.sh -e env.sh -n <g5k job name>
The Hadoop on BSFS will be deployed automatically.
To clean up the deployement, use 'hb-clean.sh'.
$ ./hb-clean.sh -e env.sh
Furthermore, we also provide the script that deploys Hadoop with HDFS on Grid5000. It is in the folder 'hadoop-deploy-script' in the same achieve 'BSFS-1.2.1.zip'. It is very easy to use. There are two scripts in the folder, one for deployment, the other for cleaning up the deployment. For deployment, simply run 'deploy-hadoop.sh' with two parameters: hadoop home folder and Grid5000 job name.
$ ./deploy-hadoop.sh -o <hadoop-dir> -n <job name>
For more information about options, please use the '-h/–help' option.
To clean up the deployment, run 'clean-hadoop.sh' with the parameter indicating hadoop home folder.
$ ./clean-hadoop.sh -o <hadoop-dir>