The release of Spark 3.0, news about Databricks and Delta Lake and much more! Takeaways from Spark.
It’s been almost a month since Spark Summit + AI 2019 at San Francisco. I had a fantastic experience. So many interesting talks and most importantly, some exciting announcements that I’m here to share with you.
Yes, Spark 2.4 is not enough, now we have a new version to play around and one of the coolest features is that it will provide Koalas.
Koalas is a library that pretty much replicates every single method from Pandas, but with the ability to run over spark clusters. This is really interesting because now all the pipelines that you program and test in Pandas data frame can be reused in Spark by simply changing two lines of code:
Now you can code locally, being fully scalable. Amazing…
I got to know Delta Lake in details this summit (many talks/tutorials explained this framework thoroughly) and it got me really engaged. Delta Lake is a framework that works on top of your parquet data lake and it provides a set of amazing properties:
This is particularly useful for data science purposes. Why? Because you can train your models using previous data, test it with current data, and once you are confident enough you move the window.
Delta Lake has been around for almost a year now, gaining quite a bit of robustness. However, up to this Summit, it was private and only those guys working on the cloud and Databricks clients had access to it. Now they have open sourced it to the world which has two major benefits:
I will talk about three of them which I found relevant for what we do on a daily basis on Retargetly.
The talk I liked the most was delivered by Andrew Clegg.
Where he talked about efficient joins on Spark. Whoever has worked with Spark for a while knows how painful joins are in terms of execution time and resources. These problems usually come around when you are dealing with skew data. Meaning, you have some keys with a ton of rows, whereas most of the remaining ones have just a few.
A good way to pinpoint this problem is by looking at the Spark UI and check out the time distributions.
Andrew proposed a way to deal with these situations fairly easily:
Isn’t it awesome??
However, it might introduce many new rows for D2, which is kinda undesirable. Thus, insted of doing this for the complete dataframe, we can do it for the keys that we know that are skew, we perform a regular join for the rest of the keys, and then Union them. Still easy.
I loved the approach overall, and it’s extendable for stranger use cases. Also, it can be tweaked for group’s (the other operation that causes many problems).
The second very interesting speech was given by Beck Cronin-Dixon, data-engineer at Eventbrite.
She explained how to build basic ETLs very performant depending on the use cases. She suggested 4 different ingestion approaches for doing near real-time analytics:
Good: latency in minutes. Ingestion is easy to implement. Simplifies data lake.
Bad: require a compaction process. Extra logic in the queries.
Bad: Batch writes to a key value store are slower than using hdfs. Not optimized for large scans.
Good: Very fast, relatively easy to implement.
Bad: The streaming merge is complicated, and the processing is required on the read part.
At Retargetly we have dealt with a diverse range of pipelines, and all of them have different requirements. The hybrid approach is something that we are going to test for sure in the short term since it is very promising.
Overall, this summit tells us that we have made the right choice since Spark tends to keep being on top of the most adopted technology in the industry.
We started working with Spark a couple of months ago and it has been quite tough I should say, but its benefits are astonishing and coding with map-reduce paradigm is a ton of fun!
In following posts, we will talk about the structure of our data stack, and how we configure our pipelines. Stay tuned!
By Mathias Longo, Chief Data Scientist at Retargetly
Discover our exciting findings and learn how we are leveraging Topics as an additional signal within our segmentation and optimization model
Some misconceptions about first party data and how we can use it more efficiently
Thanks to Retargetly's Platform in conjunction with Havas Converged they could ran a survey that got over 5000 responses in less than a week