File this post under “late to the game”, but I just completed a project where I used Apache Spark for the first time and I’m blown away. Here’s my experience.

No Cluster Needed

Perhaps it was my bias from working with Apache Impala a few years back, but I just assumed that Spark was going to need Hadoop set up on a cluster of servers. I didn’t want to spend my time getting all that set up just to play around with Spark, so I never bothered with it before.

Turns out, Spark has a rather robust single-machine setup. Even better, there’s an R library that took care of all the set up for me.

sparklyr Makes Spark Simple

The R package sparklyr made my foray into Spark dead simple. The package happily installed Spark for me and provided me functions to easily start and stop a Spark instance from within my R scripts.

Pro tip: by default sparklyr limits Spark to a single core when it starts up an instance. You can change to multiple cores pretty easily and it makes a world of difference in terms of performance.

dplyr and Spark Is a Powerful Combination

sparklyr gave me access to the tables I loaded into Spark. dplyr gave me the ability to manipulate and query those tables via dbplyr.

dplyr is amazing. Rather than hand-writing Spark SQL, dplyr provides a set of functions that allowed me to join tables, add where clauses, and manipulate the columns returned from Spark.

Great Performance

My project was to explore replacing an existing part of our data pipeline. Using Spark, our processing time went from days to hours.

More Spark in the Future

After this successful venture into Spark territory, I’m pretty sure I’ll be employing Spark in future projects.