Wednesday 3 August 2016

Kafka Questions

1 .Benchmarking kafka :

Source :Lnkdin Blogs :
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Q.2 Difference between kafka Receiver based approach and Direct Approach
Ans .


 Receiver-based Approach :

This approach uses a Receiver to receive the data. The Receiver is implemented using the Kafka high-level consumer API. As with all receivers, the data received from Kafka through a Receiver is stored in Spark executors, and then jobs launched by Spark Streaming processes the data.
However, under default configuration, this approach can lose data under failures (see receiver reliability. To ensure zero-data loss, you have to additionally enable Write Ahead Logs in Spark Streaming (introduced in Spark 1.2). This synchronously saves all the received Kafka data into write ahead logs on a distributed file system (e.g HDFS), so that all the data can be recovered on failure

Direct Approach (No Receivers) :

Instead of using receivers to receive data, this approach periodically queries Kafka for the latest offsets in each topic+partition, and accordingly defines the offset ranges to process in each batch.
  • Simplified Parallelism: No need to create multiple input Kafka streams and union them. With directStream, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume, which will all read data from Kafka in parallel. So there is a one-to-one mapping between Kafka and RDD partitions, which is easier to understand and tune.
  • Efficiency: Achieving zero-data loss in the first approach required the data to be stored in a Write Ahead Log, which further replicated the data. This is actually inefficient as the data effectively gets replicated twice - once by Kafka, and a second time by the Write Ahead Log. This second approach eliminates the problem as there is no receiver, and hence no need for Write Ahead Logs. As long as you have sufficient Kafka retention, messages can be recovered from Kafka.
  • Exactly-once semantics: The first approach uses Kafka’s high level API to store consumed offsets in Zookeeper. This is traditionally the way to consume data from Kafka. While this approach (in combination with write ahead logs) can ensure zero data loss (i.e. at-least once semantics), there is a small chance some records may get consumed twice under some failures. This occurs because of inconsistencies between data reliably received by Spark Streaming and offsets tracked by Zookeeper. Hence, in this second approach, we use simple Kafka API that does not use Zookeeper. Offsets are tracked by Spark Streaming within its checkpoints.
Source :http://spark.apache.org/docs/latest/streaming-kafka-integration.html

Q.In Kafka how does messages get distributed to partitioner ?
Ans. Primary objective of Partition is to achieve parallelism. So you can have as many number of consumers reading from same topic as you have partitions.
So if you create partition  then you decide how messages should go to which partition. So either use "key" while sending messages to Topic or write your own custom logic for same. e.g. You can have 3 Partitions and divided your messages in High, Medium, Low priority. So you can implement a partitioner such that High goes to 1 , Medium goes to 1 and low goes to 2 

Q. How to purge kafka topic ?
Ans . 
Temporarily update the retention time on the topic to one second:

Command:
kafka-topics.sh --zookeeper localhost:13003 --alter --topic MyTopic --config retention.ms=1000


Then wait for the purge to take effect (about one minute). Once purged, restore the previous retention.ms value.

No comments:

Post a Comment