Thursday 4 August 2016

Spark Questions

Q.1 Difference between reduceByKey , groupByKey, aggregateByKey, combineByKey

Ans . 

reduceByKey  : 
      1.  Merge the values for each key using an associative reduce function.

      2. This will also perform the merging locally on each mapper before
        sending results to a reducer, similarly to a "combiner" in MapReduce.

      3 . Output will be partitioned with C{numPartitions} partitions, or
        the default parallelism level if C{numPartitions} is not specified.
        Default partitioner is hash-partition.

aggregateByKey() :

aggregateByKey() will combine the values for a particular key, and the result of such combination can be any object that you specify. You have to specify how the values are combined ("added") inside one partition (that is executed in the same node) and how you combine the result from different partitions (that may be in different nodes). 

reduceByKey is a particular case, in the sense that the result of the combination (e.g. a sum) is of the same type that the values, and that the operation when combined from different partitions is also the same as the operation when combining values inside a partition.

The reason reduceByKey() is so much better is because it makes use of a MapReduce feature called a combiner. Any function like + or * can be used in this fashion because the order of the elements it is called on doesn't matter. This allows Spark to start "reducing" values with the same key even if they are not all in the same partition yet.




The aggregateByKey function requires 3 parameters:

1. An intitial ‘zero’ value that will not effect the total values to be collected. For example if we were adding numbers the initial value would be 0. Or in the case of collecting unique elements per key, the initial value would be an empty set.

2. A combining function accepting two paremeters. The second paramter is merged into the first parameter. This function combines/merges values within a partition.This function can return a different result type, U, than the type  of the values in this RDD, V.

3. A merging function function accepting two parameters. In this case the paremters are merged into one. This step merges values across partitions.


 To avoid memory allocation, both of these functions are  allowed to modify and return their first argument instead of  creating a new U.


groupByKey() :

gives you more versatility since you write a function that takes an Iterable, meaning you could even pull all the elements into an array. However it is inefficient because for it to work the full set of (K,V,) pairs have to be in one partition.


Q .2 Difference between join and union

Union :     

Return the union of this RDD and another one.
rdd = sc.parallelize([1, 1, 2, 3])
rdd.union(rdd).collect()
[1, 1, 2, 3, 1, 1, 2, 3]

Join :
Join is done on PairRDDs


Ans .
     map() :
     Returns a new RDD by applying a function to each element of this RDD.
     example :
     rdd = sc.parallelize(["b", "a", "c"])
     sorted(rdd.map(lambda x: (x, 1)).collect())
      [('a', 1), ('b', 1), ('c', 1)]

     mapPartitioons():
     Return a new RDD by applying a function to each partition of this RDD.
     example :
      rdd = sc.parallelize([1, 2, 3, 4], 2)
      def f(iterator): yield sum(iterator)
      rdd.mapPartitions(f).collect()
        [3, 7]
 
    mapPartitionsWithIndex() :
    example :
    Return a new RDD by applying a function to each partition of this RDD,
        while tracking the index of the original partition.

       rdd = sc.parallelize([1, 2, 3, 4], 4)
       def f(splitIndex, iterator): yield splitIndex
       rdd.mapPartitionsWithIndex(f).sum()
        6
Q.4 What are transformations in spark ?
Ans .
Transformations on DStreams can be grouped into either stateless or stateful:
• In stateless transformations the processing of each batch does not depend on the
data of its previous batches. They include the common RDD transformations we
have seen in Chapters 3 and 4, like map(), filter(), and reduceByKey().
• Stateful transformations, in contrast, use data or intermediate results from previous
batches to compute the results of the current batch. They include transformations
based on sliding windows and on tracking state across time.

Q.5 What is schemaRDD in Spark ?

Ans. Spark SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark. At the core of this component is a new type of RDD,SchemaRDDSchemaRDDs are composed of Row objects, along with a schema that describes the data types of each column in the row.

Q.6 What is lazy transformation in spark ?

Ans
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in areduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

Q.7 What  are the spark core components ?
Ans .





Spark uses a master/slave architecture. As you can see in the figure, it has one central coordinator (Driver) that communicates with many distributed workers (executors). The driver and each of the executors run in their own Java processes.

DRIVER
The driver is the process where the main method runs. First it converts the user program into tasks and after that it schedules the tasks on the executors.
EXECUTORS
Executors are worker nodes' processes in charge of running individual tasks in a given Spark job. They are launched at the beginning of a Spark application and typically run for the entire lifetime of an application. Once they have run the task they send the results to the driver. They also provide in-memory storage for RDDs that are cached by user programs through Block Manager.
APPLICATION EXECUTION FLOW
With this in mind, when you submit an application to the cluster with spark-submit this is what happens internally:
  1. A standalone application starts and instantiates a SparkContext instance (and it is only then when you can call the application a driver).
  2. The driver program ask for resources to the cluster manager to launch executors.
  3. The cluster manager launches executors.
  4. The driver process runs through the user application. Depending on the actions and transformations over RDDs task are sent to executors.
  5. Executors run the tasks and save the results.
  6. If any worker crashes, its tasks will be sent to different executors to be processed again. In the book "Learning Spark: Lightning-Fast Big Data Analysis" they talk about Spark and Fault Tolerance.
  7. With SparkContext.stop() from the driver or if the main method exits/crashes all the executors will be terminated and the cluster resources will be released by the cluster manager
Q .8 Define Partitions?

Ans.

As the name suggests, partition is a smaller and logical division of data  similar to ‘split’ in MapReduce. Partitioning is the process to derive logical units of data to speed up the processing process. Everything in Spark is a partitioned RDD.

Q .9 What is RDD Lineage ?

Ans .

Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build from other datasets.

Q .10 What are benefits of Spark over MapReduce?

Ans.
  1. • Due to the availability of in-memory processing, Spark implements the processing around 10-100x faster than Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data processing tasks.
  2. • Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
  3. • Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage
  4. • Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.
  5. Read more in this blog about the comparison of Spark and MapReduce.

Ans. 

RDD Lineage Graph (aka RDD operator graph) is a graph of all the parent RDDs of a RDD. It is built as a result of applying transformations to the RDD and creates a logical execution plan.
The following diagram uses cartesian or zip for learning purposes only. You may use other operators to build a RDD graph.

Figure .RDD Lineage

The above RDD graph could be the result of the following series of transformations:
      val r00 = sc.parallelize(0 to 9)

      val r01 = sc.parallelize(0 to 90 by 10)
     val r10 = r00 cartesian r01
     val r11 = r00.map(n => (n, n))
     val r12 = r00 zip r01
     val r13 = r01.keyBy(_ / 20)
    val r20 = Seq(r11, r12, r13).foldLeft(r10)(_ union _)
    
A RDD lineage graph is hence a graph of what transformations need to be executed     after an action has been called.
you ccan check lineage information using rdd.toDebugString() function.

No comments:

Post a Comment