Wednesday 3 August 2016

Kafka Questions

1 .Benchmarking kafka :

Source :Lnkdin Blogs :


Q. What is difference between kafka Receiver based approach and Direct Approach
Ans .

 Receiver-based Approach :

This approach uses a Receiver to receive the data. The Receiver is implemented using the Kafka high-level consumer API. As with all receivers, the data received from Kafka through a Receiver is stored in Spark executors, and then jobs launched by Spark Streaming processes the data.
However, under default configuration, this approach can lose data under failures (see receiver reliability. To ensure zero-data loss, you have to additionally enable Write Ahead Logs in Spark Streaming (introduced in Spark 1.2). This synchronously saves all the received Kafka data into write ahead logs on a distributed file system (e.g HDFS), so that all the data can be recovered on failure

Direct Approach (No Receivers) :

Instead of using receivers to receive data, this approach periodically queries Kafka for the latest offsets in each topic+partition, and accordingly defines the offset ranges to process in each batch.

  • Simplified Parallelism: No need to create multiple input Kafka streams and union them. With directStream, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume, which will all read data from Kafka in parallel. So there is a one-to-one mapping between Kafka and RDD partitions, which is easier to understand and tune.
  • Efficiency: Achieving zero-data loss in the first approach required the data to be stored in a Write Ahead Log, which further replicated the data. This is actually inefficient as the data effectively gets replicated twice - once by Kafka, and a second time by the Write Ahead Log. This second approach eliminates the problem as there is no receiver, and hence no need for Write Ahead Logs. As long as you have sufficient Kafka retention, messages can be recovered from Kafka.
  • Exactly-once semantics: The first approach uses Kafka’s high level API to store consumed offsets in Zookeeper. This is traditionally the way to consume data from Kafka. While this approach (in combination with write ahead logs) can ensure zero data loss (i.e. at-least once semantics), there is a small chance some records may get consumed twice under some failures. This occurs because of inconsistencies between data reliably received by Spark Streaming and offsets tracked by Zookeeper. Hence, in this second approach, we use simple Kafka API that does not use Zookeeper. Offsets are tracked by Spark Streaming within its checkpoints.
Source :http://spark.apache.org/docs/latest/streaming-kafka-integration.html

Q.In Kafka how does messages get distributed to partitioner ?

Ans. Primary objective of Partition is to achieve parallelism. So you can have as many number of consumers reading from same topic as you have partitions.
So if you create partition  then you decide how messages should go to which partition. So either use "key" while sending messages to Topic or write your own custom logic for same. e.g. You can have 3 Partitions and divided your messages in High, Medium, Low priority. So you can implement a partitioner such that High goes to 1 , Medium goes to 1 and low goes to 2 

Q. How to purge kafka topic ?
Ans . 
Temporarily update the retention time on the topic to one second:

Command:

kafka-topics.sh --zookeeper localhost:13003 --alter --topic MyTopic --config retention.ms=1000

Then wait for the purge to take effect (about one minute). Once purged, restore the previous retention.ms value.

Q. what is role of zookeeper in kafka ?
Kafka is built to use Zookeeper. 
Kafka is a distributed system and uses Zookeeper to track status of kafka cluster nodes. It also keeps track of Kafka topics, partitions etc.
It is basically used to maintain co-ordination between different nodes in a cluster, one of the most important thing for kafka is it uses zookeeper to periodically commit offset so that in case of node failure it can resume from the previously committed offset (imagine yourself taking care of all this by your own). Zookeeper also plays vital role for serving so many other purposes such as leader detection, configuration management, synchronization, detecting when a new node join or leaves the cluster etc
Now how Kafka uses ZooKeeper?
As of v0.8 Kafka uses zookeeper for storing variety of configurations as K,V in the ZK data tree and use them across the cluster in a distributed fashion Lets take 2 simple use cases for which Kafka maintains values in ZooKeeper

1.    Topics under a broker -/brokers/topics/[topic] 

2.   Next Offset for a Consumer/Topic/Partition combination -/consumers/[groupId]/offsets/[topic]/[partitionId]

Now think about “distributed-ness” Of course configurations like these are replicated and distributed throughout the ZooKeeper ensemble – Leader node and Follower nodes. So I don’t think Kafka will work without ZooKeeper (at least pre 0.8.2 version of Kafka).

Kafka uses Zookeeper for the following:


  1. Electing a controller. The controller is one of the brokers and is responsible for maintaining the leader/follower relationship for all the partitions. When a node shuts down, it is the controller that tells other replicas to become partition leaders to replace the partition leaders on the node that is going away. Zookeeper is used to elect a controller, make sure there is only one and elect a new one it if it crashes.
  2. Cluster membership - which brokers are alive and part of the cluster? this is also managed through ZooKeeper.
  3. Topic configuration - which topics exist, how many partitions each has, where are the replicas, who is the preferred leader, what configuration overrides are set for each topic
  4. (0.9.0) - Quotas - how much data is each client allowed to read and write
  5. (0.9.0) - ACLs - who is allowed to read and write to which topic
  6. (old high level consumer) - Which consumer groups exist, who are their members and what is the latest offset each group got from each partition.
  7.   Zookeeper is mainly used to track status of nodes present in Kafka cluster and also to keep track of Kafka topics, messages, etc.

 Q.2 How producer identify the leader in kafka?
Ans .The producer sends a Metadata request with a list of topics to one of the brokers in the broker-list you supplied when configuring the producer.
The broker responds with a list of partitions in those topics and the leader for each partition. The producer caches this information and knows where to redirect its produce messages. 
Q.3 Is Kafka good for a scenario involving 100s of thousands of topics (Where 10s of topics or a topic with 10s of partition will represent a data flow for a customer)?
Ans . Technically, Kafka will work nicely with 1000-3000 partitions per broker, so with enough brokers your plan will work.
However, in my experience, whenever people come up with a plan that ends up with 100K topics or more, they are not designing their data model correctly. This is especially true for people who used JMS queues before and are applying the same type of design to Kafka.
Q.4  What's the point of partitions in Kafka? How do I know how many partitions are best for a data set of size X?
Q .5 What happens when number of consumer is greater than partitions vice-versa ?
Ans .

 Consumers read from any single partition, allowing you to scale throughput of message consumption in a similar fashion to message production. Consumers can also be organized into consumer groups for a given topic — each consumer within the group reads from a unique partition and the group as a whole consumes all messages from the entire topic.

 If you have more consumers than partitions then some consumers will be idle because they have no partitions to read from. 

If you have more partitions than consumers then consumers will receive messages from multiple partitions

If you have equal numbers of consumers and partitions, each consumer reads messages in order from exactly one partition.

following diagram explains the scenario :




Q .6 How kafka guarantees consistency and availabiblty ?                                                              
                                                       
Ans.Before beginning the discussion on consistency and availability, keep in mind that these guarantees hold as long as you are producing to one partition and consuming from one partition. All guarantees are off if you are reading from the same partition using two consumers or writing to the same partition using two producers.
Kafka makes the following guarantees about data consistency and availability: (1) Messages sent to a topic partition will be appended to the commit log in the order they are sent, (2) a single consumer instance will see messages in the order they appear in the log, (3) a message is ‘committed’ when all in sync replicas have applied it to their log, and (4) any committed message will not be lost, as long as at least one in sync replica is alive.
The first and second guarantee ensure that message ordering is preserved for each partition. Note that message ordering for the entire topic is not guaranteed. The third and fourth guarantee ensure that committed messages can be retrieved. In Kafka, the partition that is elected the leader is responsible for syncing any messages received to replicas. Once a replica has acknowledged the message, that replica is considered to be in sync. To understand this further, lets take a closer look at what happens during a write.

No comments:

Post a Comment