Sunday 20 September 2015

Setting up pyspark in Eclipse

Install Pydev plugin in Eclipse

Then navigate to

Project -> Properties -> PyDev - PYTHONPATH -> External libraries

add source folders

/path/to/spark/spark-0.9.1/python
/path/to/pyspark-src.zip

Set SPARK_HOME in Eclipse

in begining of your program set this:

import os

os.environ["SPARK_HOME"]="/usr/hdp/spark"

Give root directory of spark installation

Multiple OutPut Format In Hadoop

The MultipleOutputs class simplifies writing data to multiple outputs.

  • Configure a named output with a name and an OutputFormat .
  • When writing out a < key ,value pair, specify the named output to send it to.


This is accomplished by assigning names to each output, using the static
 addNamedOutput method of MultipleOutputs
For example: 
In Driver Class add following:


MultipleOutputs.addNamedOutput(job, "QuantityData", TextOutputFormat.class, NullWritable.class, Text.class);

MultipleOutputs.addNamedOutput(job,"loadProfileChannelData",TextOutputFormat.class,NullWritable.class,Text.class);
MultipleOutputs.addNamedOutput(job, "registerValueData", TextOutputFormat.class, NullWritable.class, Text.class);

In Mapper Class add following:

public class MyParserMapper   extends
    Mapper<LongWritable, Text, NullWritable, Text> {
private MultipleOutputs<NullWritable,Text> outs;
   @Override
   public void setup(Context context)throws IOException, InterruptedException{
  outs = new MultipleOutputs<NullWritable,Text>(context);
   }

Twitter Kafka Integration

Add these maven dependency in your project:


 <dependency>
       <groupId>com.twitter</groupId>
      <artifactId>hbc-core</artifactId>
      <version>2.2.0</version>
</dependency>

<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.9.2</artifactId>
<version>0.8.2.0</version>
</dependency>

Create a Topic to whom producer will send the tweets:


kafka-topics.sh  --create  --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic twitter_topic

producer will send tweets to this topic
private static final String topic = "twitter_topic";

Properties properties = new Properties();
properties.put("metadata.broker.list", "localhost:6667");
properties.put("serializer.class", "kafka.serializer.StringEncoder");

ProducerConfig producerConfig = new ProducerConfig(properties);
kafka.javaapi.producer.Producer<String, String> producer = new kafka.javaapi.producer.Producer<String, String>(producerConfig);

BlockingQueue<String> queue = new LinkedBlockingQueue<String>(10000);
StatusesFilterEndpoint endpoint = new StatusesFilterEndpoint();

endpoint.trackTerms(Lists.newArrayList("Twitter-API","#TOI")); //put search keyword here

Authentication auth = new OAuth1(consumerKey, consumerSecret, token,
secret); //Twitter authentication   (get  key, token value from https://dev.twitter.com/ )

Client client = new ClientBuilder().hosts(Constants.STREAM_HOST).endpoint(endpoint).authentication(auth).processor(new StringDelimitedProcessor(queue)).build();

//create connection to client
client.connect();
for (int msgRead = 0; msgRead < 1000; msgRead++) {
KeyedMessage<String, String> message = null;
try {
message = new KeyedMessage<String, String>(topic, queue.take());
System.out.println("message : \n"+message);
} catch (InterruptedException e) {
e.printStackTrace();
}
producer.send(message);
}
producer.close();
client.stop();


Run this program as a java application

This will act as twitter -producer for kafka


For consuming tweets from topic 

kafka-console-consumer.sh --zookeeper localhost:2181 --topic twitter_topic --from-beginning