Blog Archives

Installing SparkR on a Hadoop Cluster

Purpose

SparkR is an extension to Apache Spark which allows you to run Spark jobs with the R programming language. This provides the benefit of being able to use R packages and libraries in your Spark jobs. In the case of both Cloudera and MapR, SparkR is not supported and would need to be installed separately.

Installation Steps

Here are the steps you can take to Install SparkR on a Hadoop Cluster:

  1. Execute the following steps on all the Spark Gateways/Edge Nodes
    1. Login to the target machine as root
    2. Install R and other Dependencies
      1. Execute the following to Install
        1. Ubuntu
          sh -c 'echo "deb http://cran.rstudio.com/bin/linux/debian lenny-cran/" >> /etc/apt/sources.list'
          apt-get install r-base r-base-dev
        2. Centos
          1. Install the repo
            rpm -ivh http://mirror.unl.edu/epel/6/x86_64/epel-release-6-8.noarch.rpm
          2. Enable the repo
            1. Edit the /etc/yum.repos.d/epel-testing.repo file with your favorite text editing software
            2. Change all the enabled sections to ‘1’
          3. Clean yum cache
            yum clean all
          4. Install R and Dependencies
            yum install R R-devel libcurl-devel openssl-devel
      2. Test R installation
        1. Start up an R Session
          R
        2. Within the R Shell, execute an addition command to ensure things are ran correctly
          1 + 1
        3. Quit when you’re done
          quit()
      3. Note: R libraries gets installed at “/usr/lib64/R”
    3. Get the version of Spark you currently have installed
      1. Run the following command
        spark-submit --version
      2. Example output: 1.6.0
      3. Replace the Placeholder {SPARK_VERSION} with this value
    4. Install SparkR
      1. Start up the R console
        R
      2. Install the Depending R Packages
        install.packages("devtools")
        install.packages("roxygen2")
        install.packages("testthat")
      3. Install the SparkR Packages
        devtools::install_github('apache/spark@v{SPARK_VERSION}', subdir='R/pkg')
        install.packages('sparklyr')
      4. Close out of the R shell
        quit()
    5. Find the Spark Home Directory and replace the Placeholder {SPARK_HOME_DIRECTORY} with this value
    6. Install the SparkR OS Dependencies
      cd /tmp/
      wget https://github.com/apache/spark/archive/v{SPARK_VERSION}.zip
      unzip v{SPARK_VERSION}.zip
      cd spark-{SPARK_VERSION}
      cp -r R {SPARK_HOME_DIRECTORY}
      cd bin
      cp sparkR {SPARK_HOME_DIRECTORY}/bin/
    7. Run Dev Install
      cd {SPARK_HOME_DIRECTORY}/R/
      sh install-dev.sh
    8. Create a new file “/user/bin/sparkR” and set the contents
      1. Copy the contents of the /usr/bin/spark-shell file to /usr/bin/sparkR
        cp /usr/bin/spark-shell /usr/bin/sparkR
      2. Edit the /usr/bin/sparkR file. Replace “spark-shell” with “sparkR” on the bottom exec command.
    9. Finish install
      sudo chmod 755 /usr/bin/sparkR
    10. Verify that the sparkR command is available
      cd ~
      which sparkR
    11. Your done!

Testing

Upon completion of the installation steps, here are some ways that you can test the installation to verify everything is running correctly.

  • Test from R Console – Run on a Spark Gateway
    1. Start an R Shell
      R
    2. Execute the following commands in the R Shell
      library(SparkR)
      library(sparklyr)
      Sys.setenv(SPARK_HOME='{SPARK_HOME_DIRECTORY}')
      Sys.setenv(SPARK_HOME_VERSION='{SPARK_VERSION}')
      Sys.setenv(YARN_CONF_DIR='{YARN_CONF_DIRECTORY}')
      sc = spark_connect(master = "yarn-client")
    3. If this runs without errors then you know it’s working!
  • Test from SparkR Console – Run on a Spark Gateway
    1. Open the SparkR Console
      sparkR
    2. Verify the Spark Context is available with the following command:
      sc
    3. If the sc variable is listed then you know it’s working!
  • Sample code you can run to test more
    rdd = SparkR:::parallelize(sc, 1:5)
    SparkR:::collect(rdd)