spark

module (bold = default) build dependencies
spark/1.0.2 x86_64 java/1.8.0
spark/1.1.0 x86_64 java/1.8.0
spark/1.3.0 x86_64 java/1.8.0 python/2.7.10/b1
spark/1.6.1 x86_64 java/1.8.0 python/2.7.10/b1
spark/2.0.0/b1 x86_64 java/1.8.0 python/2.7.10/b1
spark/2.1.0/b1 x86_64 java/1.8.0

Trying out spark interactively

Use the interactive script to get a job with however many nodes/cores you need - for example

  interactive -p debug -t 60 -N 2 --ntasks-per-node=22

You can then create a spark cluster within your slurm allocation by running

 module load spark
 source start-spark

You can then start a scala shell that will connect to your spark cluster by running

  spark-shell

or a python shell by running

  pyspark

or if you launch a jupyter notebook using

  jupyter notebook --ip=`hostname`.bluehive.circ.private --port=8888

you can then connect to the spark context from within the notebook using

import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.1-src.zip'))   
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

you can then connect to your jupyter notebook using firefox on x2go or you can tunnel the connection back to your workstation following the instructions for jupyter

Starting a spark cluster inside of slurm

Below is a sample script that will create a spark cluster using the nodes in your job.

#!/bin/bash
#SBATCH -N 2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24
#SBATCH -p debug
#SBATCH -t 4

. /software/modules/init/sh
module load spark/1.3.0
source start-spark
spark-submit --name "My app" --class SimpleCounter myfile.jar  # for scala or java 
spark-submit --name "My app" myfile.py                         # for python
stop-spark

While your job is running you can view the status by going to port 8080 on the master from within a web browser running on bluehive. (ie through x2go)

also see Using Software.