Sparklyr
1.7 is now available on CRAN
!
To install sparklyr 1.7 from CRAN, run
|
|
In this blog post, we wish to present the following highlights from the sparklyr 1.7 release:
- Image and binary data sources
- New spark_apply() capabilities
- Better integration with sparklyr extensions
- Other exciting news
Image and binary data sources#
As a unified analytics engine for large-scale data processing, Apache Spark
is well-known for its ability to tackle challenges associated with the volume, velocity, and last but
not least, the variety of big data. Therefore it is hardly surprising to see that – in response to recent
advances in deep learning frameworks – Apache Spark has introduced built-in support for
image data sources
and binary data sources
(in releases 2.4 and 3.0, respectively).
The corresponding R interfaces for both data sources, namely,
spark_read_image()
and
spark_read_binary()
, were shipped
recently as part of sparklyr 1.7.
The usefulness of data source functionalities such as spark_read_image() is perhaps best illustrated
by a quick demo below, where spark_read_image(), through the standard Apache Spark
ImageSchema
,
helps connecting raw image inputs to a sophisticated feature extractor and a classifier, forming a powerful
Spark application for image classifications.
The demo#
In this demo, we shall construct a scalable Spark ML pipeline capable of classifying images of cats and dogs
accurately and efficiently, using spark_read_image() and a pre-trained convolutional neural network
code-named Inception (Szegedy et al. (2015)).
The first step to building such a demo with maximum portability and repeatability is to create a sparklyr extension that accomplishes the following:
- Specifying the required MVN dependencies of this demo (namely, the
Spark Deep Learning library
(Databricks, Inc. (2019)), which contains an
Inception-V3-based image feature extractor accessible through the Spark ML Transformer interface ) - Bundling with itself two randomly selected
1 and disjoint subsets of the
dogs-vs-cats dataset (Elson et al. (2007)) as train and test data, which are stored in the
extdata/{train,test}sub directories of the package)
A reference implementation of such a sparklyr extension can be found in
here
.
The second step, of course, is to make use of the above-mentioned sparklyr extension to perform some feature
engineering. We will see very high-level features being extracted intelligently from each cat/dog image based
on what the pre-built Inception-V3 convolutional neural network has already learned from classifying a much
broader collection of images:
|
|
Third step: equipped with features that summarize the content of each image well, we can build a Spark ML pipeline that recognizes cats and dogs using only logistic regression 2
|
|
Finally, we can evaluate the accuracy of this model on the test images:
|
|
## Predictions vs. labels:
## # Source: spark<?> [?? x 2]
## label prediction
## <int> <dbl>
## 1 1 1
## 2 1 1
## 3 1 1
## 4 1 1
## 5 1 1
## 6 1 1
## 7 1 1
## 8 1 1
## 9 1 1
## 10 1 1
## 11 0 0
## 12 0 0
## 13 0 0
## 14 0 0
## 15 0 0
## 16 0 0
## 17 0 0
## 18 0 0
## 19 0 0
## 20 0 0
##
## Accuracy of predictions:
## [1] 1
New spark_apply() capabilities#
Optimizations & custom serializers#
Many sparklyr users who have tried to run
spark_apply()
or
doSpark
to
parallelize R computations among Spark workers have probably encountered some
challenges arising from the serialization of R closures.
In some scenarios, the
serialized size of the R closure can become too large, often due to the size
of the enclosing R environment required by the closure. In other
scenarios, the serialization itself may take too much time, partially offsetting
the performance gain from parallelization. Recently, multiple optimizations went
into sparklyr to address those challenges. One of the optimizations was to
make good use of the
broadcast variable
construct in Apache Spark to reduce the overhead of distributing shared and
immutable task states across all Spark workers. In sparklyr 1.7, there is
also support for custom spark_apply() serializers, which offers more fine-grained
control over the trade-off between speed and compression level of serialization
algorithms. For example, one can specify
|
|
,
which will apply the default options of qs::qserialize() to achieve a high
compression level, or
|
|
,
which will aim for faster serialization speed with less compression.
Inferring dependencies automatically#
In sparklyr 1.7, spark_apply() also provides the experimental
auto_deps = TRUE option. With auto_deps enabled, spark_apply() will
examine the R closure being applied, infer the list of required R packages,
and only copy the required R packages and their transitive dependencies
to Spark workers. In many scenarios, the auto_deps = TRUE option will be a
significantly better alternative compared to the default packages = TRUE
behavior, which is to ship everything within .libPaths() to Spark worker
nodes, or the advanced packages = <package config> option, which requires
users to supply the list of required R packages or manually create a
spark_apply() bundle.
Better integration with sparklyr extensions#
Substantial effort went into sparklyr 1.7 to make lives easier for sparklyr
extension authors. Experience suggests two areas where any sparklyr extension
can go through a frictional and non-straightforward path integrating with
sparklyr are the following:
We will elaborate on recent progress in both areas in the sub-sections below.
Customizing the dbplyr SQL translation environment#
sparklyr extensions can now customize sparklyr’s dbplyr SQL translations
through the
spark_dependency()
specification returned from spark_dependencies() callbacks.
This type of flexibility becomes useful, for instance, in scenarios where a
sparklyr extension needs to insert type casts for inputs to custom Spark
UDFs. We can find a concrete example of this in
sparklyr.sedona
,
a sparklyr extension to facilitate geo-spatial analyses using
Apache Sedona
. Geo-spatial UDFs supported by Apache
Sedona such as ST_Point() and ST_PolygonFromEnvelope() require all inputs to be
DECIMAL(24, 20) quantities rather than DOUBLEs. Without any customization to
sparklyr’s dbplyr SQL variant, the only way for a dplyr
query involving ST_Point() to actually work in sparklyr would be to explicitly
implement any type cast needed by the query using dplyr::sql(), e.g.,
|
|
.
This would, to some extent, be antithetical to dplyr’s goal of freeing R users from
laboriously spelling out SQL queries. Whereas by customizing sparklyr’s dplyr SQL
translations (as implemented in
here
and
here
), sparklyr.sedona allows users to simply write
|
|
instead, and the required Spark SQL type casts are generated automatically.
Improved interface for invoking Java/Scala functions#
In sparklyr 1.7, the R interface for Java/Scala invocations saw a number of
improvements.
With previous versions of sparklyr, many sparklyr extension authors would
run into trouble when attempting to invoke Java/Scala functions accepting an
Array[T] as one of their parameters, where T is any type bound more specific
than java.lang.Object / AnyRef. This was because any array of objects passed
through sparklyr’s Java/Scala invocation interface will be interpreted as simply
an array of java.lang.Objects in absence of additional type information.
For this reason, a helper function
jarray()
was implemented as
part of sparklyr 1.7 as a way to overcome the aforementioned problem.
For example, executing
|
|
will assign to arr a reference to an Array[MyClass] of length 5, rather
than an Array[AnyRef]. Subsequently, arr becomes suitable to be passed as a
parameter to functions accepting only Array[MyClass]s as inputs. Previously,
some possible workarounds of this sparklyr limitation included changing
function signatures to accept Array[AnyRef]s instead of Array[MyClass]s, or
implementing a “wrapped” version of each function accepting Array[AnyRef]
inputs and converting them to Array[MyClass] before the actual invocation.
None of such workarounds was an ideal solution to the problem.
Another similar hurdle that was addressed in sparklyr 1.7 as well involves
function parameters that must be single-precision floating point numbers or
arrays of single-precision floating point numbers.
For those scenarios,
jfloat()
and
jfloat_array()
are the helper functions that allow numeric quantities in R to be passed to
sparklyr’s Java/Scala invocation interface as parameters with desired types.
In addition, while previous verisons of sparklyr failed to serialize
parameters with NaN values correctly, sparklyr 1.7 preserves NaNs as
expected in its Java/Scala invocation interface.
Other exciting news#
There are numerous other new features, enhancements, and bug fixes made to
sparklyr 1.7, all listed in the
NEWS.md
file of the sparklyr repo and documented in sparklyr’s
HTML reference
pages.
In the interest of brevity, we will not describe all of them in great detail
within this blog post.
Acknowledgement#
In chronological order, we would like to thank the following individuals who
have authored or co-authored pull requests that were part of the sparklyr 1.7
release:
We’re also extremely grateful to everyone who has submitted
feature requests or bug reports, many of which have been tremendously helpful in
shaping sparklyr into what it is today.
Furthermore, the author of this blog post is indebted to @skeydan for her awesome editorial suggestions. Without her insights about good writing and story-telling, expositions like this one would have been less readable.
If you wish to learn more about sparklyr, we recommend visiting
sparklyr.ai
, spark.rstudio.com
,
and also reading some previous sparklyr release posts such as
sparklyr 1.6
and
sparklyr 1.5
.
That is all. Thanks for reading!
Databricks, Inc. 2019. Deep Learning Pipelines for Apache Spark. V. 1.5.0. Released January 25. https://spark-packages.org/package/databricks/spark-deep-learning .
Elson, Jeremy, John (JD) Douceur, Jon Howell, and Jared Saul. 2007. “Asirra: A CAPTCHA That Exploits Interest-Aligned Manual Image Categorization.” Proceedings of 14th ACM Conference on Computer and Communications Security (CCS), Proceedings of 14th ACM Conference on Computer and Communications Security (CCS) Editions. https://www.microsoft.com/en-us/research/publication/asirra-a-captcha-that-exploits-interest-aligned-manual-image-categorization/ .
Szegedy, Christian, Wei Liu, Yangqing Jia, et al. 2015. “Going Deeper with Convolutions.” Computer Vision and Pattern Recognition (CVPR). http://arxiv.org/abs/1409.4842 .
-
Fun exercise for our readers: why not experiment with different subsets of cats-vs-dogs images for training and testing, or even better, replace train and test images with your own images of cats and dogs, and see what happens? ↩︎
-
Another way to see why it works: in fact the pre-built
Inception-based feature extractor simply applies all transformationsInceptionwould have applied to its input, except for the last logistic-regression-esque affine transformation plus non-linearity producing the final categorical output, andInceptionis a highly successful convolutional neural network trained to recognize 1000 categories of animals and objects, including multiple types of cats and dogs. ↩︎
