Sending events to Kafka from spark job

2016-03-29 Thread fanooos
I am trying to read stored tweets in a file and send it to Kafka using Spark
python.

The code is very simple but it does not work. 

The spark job runs correctly but nothing sent to Kafka

Here is the code

/#!/usr/bin/python
# -*- coding: utf-8 -*-

from pyspark import SparkContext, SparkConf
from kafka import KafkaProducer
import sys 
reload(sys)
sys.setdefaultencoding('utf8')

topic = "testTopic"

if __name__ == "__main__":

conf = SparkConf().setAppName("TestSparkFromPython")

sc = SparkContext(conf=conf)

producer = KafkaProducer(bootstrap_servers="10.62.54.111:9092")

tweets =
sc.textFile("/home/fanooos/Desktop/historical_scripts/output/1/activities_201603270430_201603270440.json")

tweetsCollection = tweets.collect()
   
for tweet in tweetsCollection:
producer.send('testTopic', value=bytes(tweet))/






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Sending-events-to-Kafka-from-spark-job-tp26622.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Sending events to Kafka from spark job

2016-03-29 Thread fanooos
I think I find a solution but I have no idea how this affects the execution
of the application.

At the end of the script I added  a sleep statement. 

import time
time.sleep(1)


This solved the problem.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Sending-events-to-Kafka-from-spark-job-tp26622p26624.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Streaming Job get killed after running for about 1 hour

2016-04-24 Thread fanooos
I have a spark streaming job that read tweets stream from gnip and write it
to Kafak.

Spark and kafka are running on the same cluster. 

My cluster consists of 5 nodes. Kafka-b01 ... Kafka-b05

Spark master is running on Kafak-b05. 

Here is how we submit the spark job

*nohup sh $SPZRK_HOME/bin/spark-submit --total-executor-cores 5 --class
org.css.java.gnipStreaming.GnipSparkStreamer --master spark://kafka-b05:7077
GnipStreamContainer.jar powertrack
kafka-b01.css.org,kafka-b02.css.org,kafka-b03.css.org,kafka-b04.css.org,kafka-b05.css.org
gnip_live_stream 2 &*

After about 1 hour the spark job get killed

The logs in the nohub file shows the following exception

/org.apache.spark.storage.BlockFetchException: Failed to fetch block from 2
locations. Most recent failure cause:
at
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:595)
at
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585)
at 
org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:630)
at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:48)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: io.netty.channel.ChannelException: Unable to create Channel from
class class io.netty.channel.socket.nio.NioSocketChannel
at
io.netty.bootstrap.AbstractBootstrap$BootstrapChannelFactory.newChannel(AbstractBootstrap.java:455)
at
io.netty.bootstrap.AbstractBootstrap.initAndRegister(AbstractBootstrap.java:306)
at io.netty.bootstrap.Bootstrap.doConnect(Bootstrap.java:134)
at io.netty.bootstrap.Bootstrap.connect(Bootstrap.java:116)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:211)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
at
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at
org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
at
org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:99)
at
org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:89)
at
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588)
... 15 more
Caused by: io.netty.channel.ChannelException: Failed to open a socket.
at
io.netty.channel.socket.nio.NioSocketChannel.newSocket(NioSocketChannel.java:62)
at
io.netty.channel.socket.nio.NioSocketChannel.(NioSocketChannel.java:72)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at
io.netty.bootstrap.AbstractBootstrap$BootstrapChannelFactory.newChannel(AbstractBootstrap.java:453)
... 26 more
Caused by: java.net.SocketException: Too many open files
at sun.nio.ch.Net.socket0(Native Method)
at sun.nio.ch.Net.socket(Net.java:411)
at sun.nio.ch.Net.socket(Net.java:404)
at sun.nio.ch.SocketChannelImpl.(SocketChannelImpl.java:105)
at
sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:60)
at
io.netty.channel.socket.nio.NioSocketChannel.newSocket(NioSocketChannel.java:60)
... 33 more/


I have increased the maximum number of open files but I am still facing the
same issue. 

When I checked the stderr logs of the workers from spark web interface I
found another exception. 

/java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send

Can not import KafkaProducer in spark streaming job

2016-05-01 Thread fanooos
I have a very strange problem. 

I wrote a spark streaming job that monitor an HDFS directory, read the newly
added files, and send the contents to Kafka. 

The job is written in python and you can got the code from this link

http://pastebin.com/mpKkMkph

When submitting the job I got that error 

*ImportError: cannot import name KafkaProducer*

As you see, the error is very simple but the problem is that I could import
the KafkaProducer from both python and pyspark shells without any problem. 

I tried to reboot the machine but the situation remain the same.

What do you think the problem is?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-not-import-KafkaProducer-in-spark-streaming-job-tp26857.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can not import KafkaProducer in spark streaming job

2016-05-02 Thread fanooos
I could solve the issue but the solution is very weird. 

I run this command cat old_script.py > new_script.py then I submitted the
job using the new script.

This is the second time I face such issue with python script and I have no
explanation to what happened. 

I hope this trick help someone else who face similar situation.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-not-import-KafkaProducer-in-spark-streaming-job-tp26857p26859.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Apache Spark data locality when integrating with Kafka

2016-02-06 Thread fanooos
Dears

If I will use Kafka as a streaming source to some spark jobs, is it advised
to install spark to the same nodes of kafka cluster? 

What are the benefits and drawbacks of such a decision? 

regards 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-data-locality-when-integrating-with-Kafka-tp26165.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



java.lang.ClassNotFoundException: org.apache.spark.streaming.twitter.TwitterReceiver

2015-11-08 Thread fanooos
This is my first Spark Stream application. The setup is as following

3 nodes running a spark cluster. One master node and two slaves.

The application is a simple java application streaming from twitter and
dependencies managed by maven.

Here is the code of the application

public class SimpleApp {

public static void main(String[] args) {

SparkConf conf = new SparkConf().setAppName("Simple
Application").setMaster("spark://rethink-node01:7077");

JavaStreamingContext sc = new JavaStreamingContext(conf, new
Duration(1000));

ConfigurationBuilder cb = new ConfigurationBuilder();

cb.setDebugEnabled(true).setOAuthConsumerKey("ConsumerKey")
.setOAuthConsumerSecret("ConsumerSecret")
.setOAuthAccessToken("AccessToken")
.setOAuthAccessTokenSecret("TokenSecret");

OAuthAuthorization auth = new OAuthAuthorization(cb.build());

JavaDStream tweets = TwitterUtils.createStream(sc, auth);

 JavaDStream statuses = tweets.map(new Function() {
 public String call(Status status) throws Exception {
return status.getText();
}
});

 statuses.print();;

 sc.start();

 sc.awaitTermination();

}

}


here is the pom file

http://maven.apache.org/POM/4.0.0";
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd";>
4.0.0
SparkFirstTry
SparkFirstTry
0.0.1-SNAPSHOT


 
org.apache.spark
spark-core_2.10
1.5.1
provided



org.apache.spark
spark-streaming_2.10
1.5.1
provided



org.twitter4j
twitter4j-stream
3.0.3


org.apache.spark
spark-streaming-twitter_2.10
1.0.0





src


maven-compiler-plugin
3.3

1.8
1.8



maven-assembly-plugin



   
com.test.sparkTest.SimpleApp



jar-with-dependencies









The application starts successfully but no tweets comes and this exception
is thrown

15/11/08 15:55:46 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 78,
192.168.122.39): java.io.IOException: java.lang.ClassNotFoundException:
org.apache.spark.streaming.twitter.TwitterReceiver
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1163)
at
org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.streaming.twitter.TwitterReceiver
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at
java.io.Object

Spark Job Failed with FileNotFoundException

2016-11-01 Thread fanooos
I have a spark cluster consists of 5 nodes and I have a spark job that should
process some files from a directory and send its content to Kafka.

I am trying to submit the job using the following command

bin$ ./spark-submit --total-executor-cores 20 --executor-memory 5G --class
org.css.java.FileMigration.FileSparkMigrator --master
spark://spark-master:7077
/home/me/FileMigrator-0.1.1-jar-with-dependencies.jar /home/me/shared
kafka01,kafka02,kafka03,kafka04,kafka05 kafka_topic


The directory /home/me/shared is mounted on all the 5 nodes but when I
submit the job I got the following exception

java.io.FileNotFoundException: File file:/home/me/shared/input_1.txt does
not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:140)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
at
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



After some tries, I faced another weird behavior. When I submit the job
while the directory contains 1 or 2 files, the same exception is thrown on
the driver machine but the job continued and the files are processed
successfully. Once I add another file, the exception is thrown and the job
failed.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Job-Failed-with-FileNotFoundException-tp27980.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag

2015-02-23 Thread fanooos
Hi 

I have installed hadoop on a local virtual machine using the steps from this
URL

https://www.digitalocean.com/community/tutorials/how-to-install-hadoop-on-ubuntu-13-10

In the local machine I write a little Spark application in java to read a
file from the hadoop instance installed in the virtual machine. 

The code is below

public static void main(String[] args) {

JavaSparkContext sc = new JavaSparkContext(new
SparkConf().setAppName("Spark Count").setMaster("local"));

JavaRDD lines =
sc.textFile("hdfs://10.62.57.141:50070/tmp/lines.txt");
JavaRDD lengths = lines.flatMap(new FlatMapFunction() {
@Override
public Iterable call(String t) throws Exception {
return Arrays.asList(t.length());
}
});
List collect = lengths.collect();
int totalLength = lengths.reduce(new Function2() {
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
return v1+v2;
}
});
System.out.println(totalLength);

  }


The application throws this exception

Exception in thread "main" java.io.IOException: Failed on local
exception: com.google.protobuf.InvalidProtocolBufferException: Protocol
message end-group tag did not match expected tag.; Host Details : local host
is: "TOSHIBA-PC/192.168.56.1"; destination host is: "10.62.57.141":50070; 
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
at org.apache.hadoop.ipc.Client.call(Client.java:1351)
at org.apache.hadoop.ipc.Client.call(Client.java:1300)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1679)
at
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106)
at
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102)
at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102)
at 
org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1701)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1647)
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:222)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
at 
org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1367)
at org.apache.spark.rdd.RDD.collect(RDD.scala:797)
at
org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:309)
at org.apache.spark.api.java.JavaRDD.collect(JavaRDD.scala:32)
at org.css.RaiSpark.RaiSparkApp.main(RaiSparkApp.java:25)
Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol
message end-group tag did not match expected tag.
at
com.google.protobuf.InvalidProtocolBufferException.invalidEndTag(InvalidProtocolBufferException.java:94)
at
com.google.protobuf.CodedInputStre

Re: InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag

2015-02-23 Thread fanooos
I have resolved this issue. Actually there was two problems. 

The first problem in the application was the port of the HDFS. It was
configured (in core-site.xml) to 9000 but in the application I was using
50070 which (as I think) the default port.

The second problem, I forgot to put the file into HDFS :( .





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/InvalidProtocolBufferException-Protocol-message-end-group-tag-did-not-match-expected-tag-tp21777p21781.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark SQL Thrift Server start exception : java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory

2015-03-02 Thread fanooos
I have installed a hadoop cluster (version : 2.6.0), apache spark (version :
1.2.1 preBuilt for hadoop 2.4 and later), and hive (version 1.0.0). 

When I try to start the spark sql thrift server I am getting the following
exception. 

Exception in thread "main" java.lang.RuntimeException:
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
at
org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.scala:235)
at
org.apache.spark.sql.hive.HiveContext$$anonfun$4.apply(HiveContext.scala:231)
at scala.Option.orElse(Option.scala:257)
at
org.apache.spark.sql.hive.HiveContext.x$3$lzycompute(HiveContext.scala:231)
at org.apache.spark.sql.hive.HiveContext.x$3(HiveContext.scala:229)
at
org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:229)
at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:229)
at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:292)
at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:248)
at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:91)
at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:90)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.sql.SQLContext.(SQLContext.scala:90)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:72)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:51)
at
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:56)
at
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1412)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:62)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:72)
at
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2453)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2465)
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:340)
... 26 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1410)
... 31 more
Caused by: javax.jdo.JDOFatalUserException: Class
org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
NestedThrowables:
java.lang.ClassNotFoundException:
org.datanucleus.api.jdo.JDOPersistenceManagerFactory
at
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
at
org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:310)
at
org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:339)
at
org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:248)
at
org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:223)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at
org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:58)
at
org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStor

Connecting a PHP/Java applications to Spark SQL Thrift Server

2015-03-03 Thread fanooos
We have installed hadoop cluster with hive and spark and the spark sql thrift
server is up and running without any problem.

Now we have set of applications need to use spark sql thrift server to query
some data. 

Some of these applications are java applications and the others are PHP
applications. 

As I am an old fashioned java developer, I used to connect java applications
to BD servers like Mysql using a JDBC driver. Is there a corresponding
driver for connecting with Spark Sql Thrift server ? Or what is the library
I need to use to connect to it? 


For PHP, what are the ways we can use to connect PHP applications to Spark
Sql Thrift Server? 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Connecting-a-PHP-Java-applications-to-Spark-SQL-Thrift-Server-tp21902.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Connection PHP application to Spark Sql thrift server

2015-03-05 Thread fanooos
We have two applications need to connect to Spark Sql thrift server. 

The first application is developed in java. Having spark sql thrift server
running, we following the steps in  this link

  
and the application connected smoothly without any problem. 


The second application is developed in PHP. We followed the steps provided
in  this link
 
.  When hitting the php script, the spark sql thrift server throws this
exception. 

15/03/05 11:53:19 ERROR TThreadPoolServer: Error occurred during processing
of message.
java.lang.RuntimeException: org.apache.thrift.transport.TTransportException
at
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:189)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.thrift.transport.TTransportException
at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at
org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:182)
at
org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125)
at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253)
at
org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
at
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
... 4 more


I searched a lot about this exception but I can not figure out what is the
problem.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Connection-PHP-application-to-Spark-Sql-thrift-server-tp21925.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Is there any problem in having a long opened connection to spark sql thrift server

2015-03-09 Thread fanooos
I have some applications developed using PHP and currently we have a problem
in connecting these applications to spark sql thrift server. ( Here is the
problem I am talking about.

 
)


Until I find a solution to this problem, there is a suggestion to make a
little java application that connects to spark sql thrift server and provide
an API to the PHP applications to executes the required queries. 

>From my little experience, opening a connection and closing it for each
query is not recommended (I am talking from my experience in working with
CRUD applications the deals with some kind of database).

1- Is the same recommendation applied to working with spark sql thrift
server ? 
2- If yes, Is there any problem in having one connection connected for a
long time with the spark sql thrift server? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-in-having-a-long-opened-connection-to-spark-sql-thrift-server-tp21967.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException

2015-03-17 Thread fanooos
I have a hadoop cluster and I need to query the data stored on the HDFS using
spark sql thrift server. 

Spark sql thrift server is up and running. It is configured to read from
HIVE table. The hive table is an external table that corresponding to set of
files stored on HDFS. These files contains JSON data. 

I am connecting to spark sql thrift server using beeline. When I try to
execute a simple query like *select * from mytable limit 3* every thing
works fine.


But when  I try to execute other queries like *select count(*) from mytable*
the following exceptions is thrown

*org.apache.hadoop.hive.serde2.SerDeException:
org.codehaus.jackson.JsonParseException: Unrecognized character escape ' '
(code 32) at [Source: java.io.StringReader@34ef429a; line: 1, column: 351]*


What I understand from the exception is that there are some files contains
corrupted JSON. 


question 1 : am I understand this correctly? 
question 2 : How can I find the file(s) causes this problem if I have about
3 thousand files and each file contains about 700 line of json data ?
question 3 : If I am sure that the json in the files on HDFS contains valid
json data, what should I do ? 







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-hadoop-hive-serde2-SerDeException-org-codehaus-jackson-JsonParseException-tp22103.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark sql thrift server slower than hive

2015-03-22 Thread fanooos
We have cloudera CDH 5.3 installed on one machine.

We are trying to use spark sql thrift server to execute some analysis
queries against hive table.

Without any changes in the configurations, we run the following query on
both hive and spark sql thrift server

*select * from tableName;*

The time taken by spark is larger than the time taken by hive which is not
supposed to be the like that.

The hive table is mapped to json files stored on HDFS directory and we are
using *org.openx.data.jsonserde.JsonSerDe* for
serialization/deserialization.

Why spark takes much more time to execute the query than hive ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-sql-thrift-server-slower-than-hive-tp22177.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org