Published inDev GeniusWhy to avoid multiple chaining of withColumn() function in Spark job.Are you having multiple chaining of withColumn() in your Spark job? Let’s deep dive to understand the implication and how we can avoid it.Oct 8, 20242Oct 8, 20242
Published inTowards DevApache Spark : How to solve Reverse Mapping Problem in Spark.Problem Statement : Let’s say we have a dataset of PairRDD[K, V] where K represents the Student Id and V represents the course names to…Dec 31, 2021Dec 31, 2021
Apache Spark: aggregateByKey vs combineByKeyIn this article, we will first learn about aggregateByKey in Apache Spark and in next article (to be published later as both the topics are…Dec 27, 2021Dec 27, 2021
Java 8: Intersection Types, Lambda Serialization & java.io.NotSerializableExceptionThis blog has been split into 2 series —Mar 6, 2021Mar 6, 2021
Apache Spark: mapPartitions implementation in Spark in JavaIn this blog, we will look at the use case of mapPartitions and it’s implementation in Spark in Java API. Before going forward, please…Feb 27, 2021Feb 27, 2021
Published inAnalytics VidhyaApache Spark : Secondary Sorting in Spark in JavaWe all might have seen secondary sorting in Mapreduce/Hadoop and its key implementation. There are enough information and blogs available…Feb 18, 2021Feb 18, 2021
Apache Spark : Given a list of user’s comments, determine the latest and last time a user commented.For the above problem statement, we will be using stackoverflow dataset and its comments.xml dataset. You can download the sample dataset…Feb 7, 2021Feb 7, 2021
Apache Spark : DNA Base Count Problem :: How to count each frequencies of A, T, C, G, and N (the…What is DNA Base Count Problem and how Spark can help us to compute the counts/frequencies. To solve this problem, I will be writing Spark…Feb 6, 2021Feb 6, 2021