Big Data Analytics¶
Apache Spark, Apache Flink and Apache Hadoop are frameworks for processing and integrating Big Data. These frameworks are also offered as software modules. You can check module versions and availability with the command
marie@login$ module avail Spark
marie@login$ module avail Flink
Prerequisites: To work with the frameworks, you need access to ZIH systems and basic knowledge about data analysis and the batch system Slurm.
The usage of Big Data frameworks is different from other modules due to their master-worker approach. That means, before an application can be started, one has to do additional steps. In the following, we assume that a Spark application should be started and give alternative commands for Flink where applicable.
The steps are:
- Load the Spark software module
- Configure the Spark cluster
- Start a Spark cluster
- Start the Spark application
Apache Spark can be used in interactive and batch jobs as well as via Jupyter notebooks. All three ways are outlined in the following.
Interactive Jobs¶
Default Configuration¶
The Spark and Flink modules are available in the power
environment.
Thus, Spark and Flink can be executed using different CPU architectures, e.g., Power.
Let us assume that two nodes should be used for the computation. Use a srun
command similar to
the following to start an interactive session. The following code
snippet shows a job submission with an allocation of two nodes with 60000 MB main
memory exclusively for one hour:
marie@login.power$ srun --nodes=2 --mem=60000M --exclusive --time=01:00:00 --pty bash -l
Once you have the shell, load desired Big Data framework using the command
marie@compute$ module load Spark
marie@compute$ module load Flink
Before the application can be started, the cluster with the allocated nodes needs to be set up. To
do this, configure the cluster first using the configuration template at $SPARK_HOME/conf
for
Spark or $FLINK_ROOT_DIR/conf
for Flink:
marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf
marie@compute$ source framework-configure.sh flink $FLINK_ROOT_DIR/conf
This places the configuration in a directory called cluster-conf-<JOB_ID>
in your home directory,
where <JOB_ID>
stands for the id of the Slurm job. After that, you can start in
the usual way:
marie@compute$ start-all.sh
marie@compute$ start-cluster.sh
The necessary background processes should now be set up and you can start your application, e. g.:
marie@compute$ spark-submit --class org.apache.spark.examples.SparkPi \
$SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000
marie@compute$ flink run $FLINK_ROOT_DIR/examples/batch/KMeans.jar
Warning
Do not delete the directory cluster-conf-<JOB_ID>
while the job is still
running. This leads to errors.
Custom Configuration¶
The script framework-configure.sh
is used to derive a configuration from a template. It takes two
parameters:
- The framework to set up (parameter
spark
for Spark,flink
for Flink, andhadoop
for Hadoop) - A configuration template
Thus, you can modify the configuration by replacing the default configuration template with a customized one. This way, your custom configuration template is reusable for different jobs. You can start with a copy of the default configuration ahead of your interactive session:
marie@login.power$ cp -r $SPARK_HOME/conf my-config-template
marie@login.power$ cp -r $FLINK_ROOT_DIR/conf my-config-template
After you have changed my-config-template
, you can use your new template in an interactive job
with:
marie@compute$ source framework-configure.sh spark my-config-template
marie@compute$ source framework-configure.sh flink my-config-template
Using Hadoop Distributed Filesystem (HDFS)¶
If you want to use Spark and HDFS together (or in general more than one framework), a scheme similar to the following can be used:
marie@compute$ module load Hadoop
marie@compute$ module load Spark
marie@compute$ source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop
marie@compute$ source framework-configure.sh spark $SPARK_HOME/conf
marie@compute$ start-dfs.sh
marie@compute$ start-all.sh
marie@compute$ module load Hadoop
marie@compute$ module load Flink
marie@compute$ source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop
marie@compute$ source framework-configure.sh flink $FLINK_ROOT_DIR/conf
marie@compute$ start-dfs.sh
marie@compute$ start-cluster.sh
Batch Jobs¶
Using srun
directly on the shell blocks the shell and launches an interactive job. Apart from
short test runs, it is recommended to launch your jobs in the background using batch jobs. For
that, you can conveniently put the parameters directly into the job file and submit it via
sbatch [options] <job file>
.
Please use a batch job with a configuration, similar to the example below:
example-starting-script.sbatch
#!/bin/bash -l
#SBATCH --time=01:00:00
#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --mem=60000M
#SBATCH --job-name="example-spark"
module load Spark/3.0.1-Hadoop-2.7-Java-1.8-Python-3.7.4-GCCcore-8.3.0
function myExitHandler () {
stop-all.sh
}
#configuration
. framework-configure.sh spark $SPARK_HOME/conf
#register cleanup hook in case something goes wrong
trap myExitHandler EXIT
start-all.sh
spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.1.jar 1000
stop-all.sh
exit 0
#!/bin/bash -l
#SBATCH --time=01:00:00
#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --mem=60000M
#SBATCH --job-name="example-flink"
module load Flink/1.12.3-Java-1.8.0_161-OpenJDK-Python-3.7.4-GCCcore-8.3.0
function myExitHandler () {
stop-cluster.sh
}
#configuration
. framework-configure.sh flink $FLINK_ROOT_DIR/conf
#register cleanup hook in case something goes wrong
trap myExitHandler EXIT
#start the cluster
start-cluster.sh
#run your application
flink run $FLINK_ROOT_DIR/examples/batch/KMeans.jar
#stop the cluster
stop-cluster.sh
exit 0
Jupyter Notebook¶
You can run Jupyter notebooks with Spark and Flink on the ZIH systems in a similar way as described on the JupyterHub page.
Spawning a Notebook¶
Go to https://jupyterhub.hpc.tu-dresden.de. In the tab "Advanced", go to the field "Preload modules" and select the following Spark or Flink module:
Spark/3.0.1-Hadoop-2.7-Java-1.8-Python-3.7.4-GCCcore-8.3.0
Flink/1.12.3-Java-1.8.0_161-OpenJDK-Python-3.7.4-GCCcore-8.3.0
When your Jupyter instance is started, you can set up Spark/Flink. Since the setup in the notebook requires more steps than in an interactive session, we have created example notebooks that you can use as a starting point for convenience: SparkExample.ipynb, FlinkExample.ipynb
Warning
The notebooks only work with the Spark or Flink module mentioned above. When using other Spark/Flink modules, it is possible that you have to do additional or other steps in order to make Spark/Flink running.
Note
You could work with simple examples in your home directory, but, according to the storage concept, please use workspaces for your study and work projects. For this reason, you have to use advanced options of Jupyterhub and put "/" in "Workspace scope" field.
FAQ¶
Q: Command source framework-configure.sh hadoop $HADOOP_ROOT_DIR/etc/hadoop
gives the output:
bash: framework-configure.sh: No such file or directory
. How can this be resolved?
A: Please try to re-submit or re-run the job and if that doesn't help re-login to the ZIH system.
Q: There are a lot of errors and warnings during the set up of the session
A: Please check the work capability on a simple example as shown in this documentation.
Help
If you have questions or need advice, please use the contact form on https://scads.ai/about-us/ or contact the HPC support.