Missing data in Kafka Consumer

2016-05-05 Thread Jerry
confirmed the same number in the Broker. But when I checked either HDFS or Cassandra, the number is just 363. The data is not always lost, just sometimes... That's wired and annoying to me. Can anybody give me some reasons? Thanks! Jerry -- View this message in context: http://apache-spark

Re: Missing data in Kafka Consumer

2016-05-05 Thread Jerry
Hi David, Thank you for your response. Before inserting to Cassandra, I had checked the data have already missed at HDFS (My second step is to load data from HDFS and then insert to Cassandra). Can you send me the link relating this bug of 0.8.2? Thank you! Jerry On Thu, May 5, 2016 at 12:38

Optimize the performance of inserting data to Cassandra with Kafka and Spark Streaming

2016-02-16 Thread Jerry
time) But the Cassandra can only be inserted about 100 messages in each round of test. Can anybody give me advices why the other messages (about 900 message) can't be consumed? How do I configure and tune the parameters in order to improve the throughput of consumers? Thank you very much

Re: Optimize the performance of inserting data to Cassandra with Kafka and Spark Streaming

2016-02-17 Thread Jerry
Rado, Yes. you are correct. A lots of messages are created almost in the same time (even use milliseconds). I changed to use "UUID.randomUUID()" with which all messages can be inserted in the Cassandra table without time lag. Thank you very much! Jerry Wong On Wed, Feb 17, 2016

Fast way to parse JSON in Spark

2016-02-23 Thread Jerry
. The json messages are coming from Kafka consumer. It's over 1,500 messages per second. So the message processing (parser and write to Cassandra) is also need to be completed at the same time (1,500/second). Thanks in advance. Jerry I appreciate it if you can give me any helps and advice.

Is there any external dependencies for lag() and lead() when using data frames?

2015-08-10 Thread Jerry
pointed to with -cp when starting the spark shell, so all I do is "Test.run(sc)" in shell. Let me know what to look for to debug this problem. I'm not sure where to look to solve this problem. Thanks, Jerry

Re: Is there any external dependencies for lag() and lead() when using data frames?

2015-08-10 Thread Jerry
Thanks... looks like I now hit that bug about HiveMetaStoreClient as I now get the message about being unable to instantiate it. On a side note, does anyone know where hive-site.xml is typically located? Thanks, Jerry On Mon, Aug 10, 2015 at 2:03 PM, Michael Armbrust wrote: >

Re: Is there any external dependencies for lag() and lead() when using data frames?

2015-08-10 Thread Jerry
By the way, if Hive is present in the Spark install, does show up in text when you start the spark shell? Any commands I can run to check if it exists? I didn't setup the spark machine that I use, so I don't know what's present or absent. Thanks, Jerry On Mon, Aug 10,

Another issue with using lag and lead with data frames

2015-08-14 Thread Jerry
So it seems like dataframes aren't going give me a break and just work. Now it evaluates but goes nuts if it runs into a null case OR doesn't know how to get the correct data type when I specify the default value as a string expression. Let me know if anyone has a work around to this. PLEASE HELP M

Re: Another issue with using lag and lead with data frames

2015-08-14 Thread Jerry
ther. Hopefully those links point me to something useful. Let me know if you can run the above code/ what you did different to get that code to run. Thanks, Jerry On Fri, Aug 14, 2015 at 1:23 PM, Salih Oztop wrote: > Hi Jerry, > This blog post is perfect for window function

Re: Another issue with using lag and lead with data frames

2015-08-14 Thread Jerry
12) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) On Fri, Aug 14, 2015 at 1:39 PM, Jerry wrote: > Hi Salih, > Normally I do sort before performing that operation, but since I've been > trying to get this working for a week, I'm just loading something simple to > test if la

What am I missing that's preventing javac from finding the libraries (CLASSPATH is setup...)?

2015-08-18 Thread Jerry
ot exist import org.apache.spark.sql.hive.*; Let me know what I'm doing wrong. Thanks, Jerry

Re: What am I missing that's preventing javac from finding the libraries (CLASSPATH is setup...)?

2015-08-18 Thread Jerry
So from what I understand, those usually pull dependencies for a given project? I'm able to run the spark shell so I'd assume I have everything. What am I missing from the big picture and what directory do I run maven on? Thanks, Jerry On Tue, Aug 18, 2015 at 11:15 AM, Ted

Does the driver program always run local to where you submit the job from?

2015-08-26 Thread Jerry
a cluster. The file I'm opening is purely for the driver program and not something the worker nodes are going to read from. Thanks, Jerry

Re: Does the driver program always run local to where you submit the job from?

2015-08-26 Thread Jerry
Thanks! On Wed, Aug 26, 2015 at 2:06 PM, Marcelo Vanzin wrote: > On Wed, Aug 26, 2015 at 2:03 PM, Jerry wrote: > > Assuming your submitting the job from terminal; when main() is called, > if I > > try to open a file locally, can I assume the machine is always the one I >

Re: Starting with spark

2014-07-24 Thread Jerry
spark. Best Regards, Jerry Sent from my iPad > On Jul 24, 2014, at 6:53 AM, Sameer Sayyed wrote: > > Hello All, > > I am new user of spark, I am using cloudera-quickstart-vm-5.0.0-0-vmware for > execute sample examples of Spark. > I am very sorry for silly and basic qu

Re: Testing ETL with Spark using Pytest

2021-02-09 Thread Jerry Vinokurov
t there is no limitation on how fixtures can be used within functions or within other fixtures (which are also just functions), since the result of the fixture call is just some Python object. Hope this helps, Jerry On Tue, Feb 9, 2021 at 10:18 AM Mich Talebzadeh wrote: > I was a bit confu

Re: Testing ETL with Spark using Pytest

2021-02-09 Thread Jerry Vinokurov
antly across a test suite. In general I try to keep my fixtures to one concrete task only, so that if I find myself repeating a pattern I just factor it out into another fixture. On Tue, Feb 9, 2021 at 11:14 AM Mich Talebzadeh wrote: > Thanks Jerry for your comments. > > The easiest optio

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Jerry Peng
Craig, Thanks! Please let us know the result! Best, Jerry On Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh wrote: > > Hi Craig, > > Can you please clarify what this bug is and provide sample code causing > this issue? > > HTH > > Mich Talebzadeh, > Disting

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-18 Thread Jerry Peng
written to disk you may see duplicates when there are failures. However, if you read the output location with Spark you should get exactly once results (unless there is a bug) since spark will know how to use the commit log to see what data files are committed and not. Best, Jerry On Mon, Se

Re: Benchmark results between Flink and Spark

2015-07-05 Thread Jerry Lam
in comparisons to Flink is one of the immediate questions I have. It would be great if they have the benchmark software available somewhere for other people to experiment. just my 2 cents, Jerry On Sun, Jul 5, 2015 at 4:35 PM, Ted Yu wrote: > There was no mentioning of the versions of Flink

Re: Benchmark results between Flink and Spark

2015-07-14 Thread Jerry Lam
-Bits-and-Bytes.html > > Probably if re-ran the benchmarks with 1.5/tungsten line would close the > gap a bit(or a lot) with spark moving towards similar style off-heap memory > mgmt, more planning optimizations > > > *From:* Jerry Lam [mailto:chiling...@gmail.com ] > *Sent:* Sun

Re: Counting distinct values for a key?

2015-07-19 Thread Jerry Lam
You mean this does not work? SELECT key, count(value) from table group by key On Sun, Jul 19, 2015 at 2:28 PM, N B wrote: > Hello, > > How do I go about performing the equivalent of the following SQL clause in > Spark Streaming? I will be using this on a Windowed DStream. > > SELECT key, coun

[Spark SQL] Unexpected Behaviour

2016-03-28 Thread Jerry Lam
er").drop(d2("label")).select(d1("label")) The above code will throw an exception saying the column label is not found. Do you have a reason for throwing an exception when the column has not been dropped for d1("label")? Best Regards, Jerry

Re: [Spark SQL] Unexpected Behaviour

2016-03-28 Thread Jerry Lam
e are other bugs I found these three days that are associated with this type of joins. In one case, if I don't drop the duplicate column BEFORE the join, spark has preferences on the columns from d2 dataframe. I will see if I can replicate in a small program like above. Best Regards, Jerry O

Re: [Spark SQL] Unexpected Behaviour

2016-03-28 Thread Jerry Lam
Scan PhysicalRDD[_1#12,_2#13] Again, this is just a tip of the iceberg. I have spent hours to find out this weird behaviour. Best Regards, Jerry Best Regards, Jerry On Tue, Mar 29, 2016 at 2:01 AM, Jerry Lam wrote: > Hi Sunitha, > > Thank you for the reference Jira. It looks l

Re: [Spark SQL] Unexpected Behaviour

2016-03-29 Thread Jerry Lam
Hi Divya, This is not a self-join. d1 and d2 contain totally different rows. They are derived from the same table. The transformation that are applied to generate d1 and d2 should be able to disambiguate the labels in the question. Best Regards, Jerry On Tue, Mar 29, 2016 at 2:43 AM, Divya

Re: [Spark SQL] Unexpected Behaviour

2016-03-29 Thread Jerry Lam
rong answer though. As I said, this is just a tip of iceberg. I have experience worsen than this. For example, you might think renaming fields will work but in some cases, it still returns wrong results. Best Regards, Jerry On Tue, Mar 29, 2016 at 7:38 AM, Jerry Lam wrote: > Hi Divya, > &g

[Spark SQL]: UDF with Array[Double] as input

2016-04-01 Thread Jerry Lam
Hi spark users and developers, Anyone tried to pass in an Array[Double] as a input to the UDF? I tried it for many hours reading spark sql code but IK still couldn't figure out a way to do this. Best Regards, Jerry

Spark SQL Nested Array of JSON with empty field

2016-06-03 Thread Jerry Wong
like, val row = sqlContext.sql("SELECT firstname, middlename, lastname, address.state, address.city FROM jsontable") The compile will tell me the error of line1: no "middlename". How do I handle this case in the SQL sql? Many thanks in advance! Jerry

Spark SQL: org.apache.spark.sql.AnalysisException: cannot resolve "some columns" given input columns.

2016-06-07 Thread Jerry Wong
firstName as first_name, middleName as middle_name, lastName as last_name from jsonTable) But there are an error org.apache.spark.sql.AnalysisException: cannot resolve 'middleName' given input columns firstName, lastName; Can anybody give me your wisdom or any suggestions? Thanks! Jerry

Re: ideal number of executors per machine

2015-12-15 Thread Jerry Lam
Best Regards, Jerry > On Dec 15, 2015, at 5:18 PM, Jakob Odersky wrote: > > Hi Veljko, > I would assume keeping the number of executors per machine to a minimum is > best for performance (as long as you consider memory requirements as well). > Each executor is a proces

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Jerry Lam
Hi Kostiantyn, Can you define those properties in hdfs-site.xml and make sure it is visible in the class path when you spark-submit? It looks like a conf sourcing issue to me. Cheers, Sent from my iPhone > On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev > wrote: > > Chris, > > thanks

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Jerry Lam
Hi Kostiantyn, I want to confirm that it works first by using hdfs-site.xml. If yes, you could define different spark-{user-x}.conf and source them during spark-submit. let us know if hdfs-site.xml works first. It should. Best Regards, Jerry Sent from my iPhone > On 30 Dec, 2015, at 2:31

Re: SparkSQL integration issue with AWS S3a

2016-01-01 Thread Jerry Lam
ey is right. I'm using my phone so I cannot easily verifying. Then you can specify different user using different spark.conf via --properties-file when spark-submit HTH, Jerry Sent from my iPhone > On 31 Dec, 2015, at 2:06 pm, KOSTIANTYN Kudriavtsev > wrote: > > Hi Jerry, &

Re: SparkSQL integration issue with AWS S3a

2016-01-06 Thread Jerry Lam
Hi Kostiantyn, Yes. If security is a concern then this approach cannot satisfy it. The keys are visible in the properties files. If the goal is to hide them, you might be able go a bit further with this approach. Have you look at spark security page? Best Regards, Jerry Sent from my iPhone

[Spark SQL]: Issues with writing dataframe with Append Mode to Parquet

2016-01-12 Thread Jerry Lam
does generate _temporary files and then it moved the files under the _temporary to the output directory. Is this behavior expected? Or is it a bug? I'm using Spark 1.5.2. Best Regards, Jerry

Re: [Spark SQL]: Issues with writing dataframe with Append Mode to Parquet

2016-01-12 Thread Jerry Lam
Hi Michael, Thanks for the hint! So if I turn off speculation, consecutive appends like above will not produce temporary files right? Which class is responsible for disabling the use of DirectOutputCommitter? Thank you, Jerry On Tue, Jan 12, 2016 at 4:12 PM, Michael Armbrust wrote: > Th

Re: DataFrameWriter on partitionBy for parquet eat all RAM

2016-01-14 Thread Jerry Lam
ot of memory and very long time to read the table back. Best Regards, Jerry On Thu, Jan 14, 2016 at 2:31 PM, Arkadiusz Bicz wrote: > Hi > > What is the proper configuration for saving parquet partition with > large number of repeated keys? > > On bellow code I load 500 milion row

Re: DataFrameWriter on partitionBy for parquet eat all RAM

2016-01-15 Thread Jerry Lam
partitioned table because it takes very long (over hours on s3) to execute the sqlcontext.read.parquet("partitioned_table"). Best Regards, Jerry Sent from my iPhone > On 15 Jan, 2016, at 3:59 pm, Michael Armbrust wrote: > > See here for some workarounds: > https://

Re: How To Save TF-IDF Model In PySpark

2016-01-15 Thread Jerry Lam
Can you save it to parquet with the vector in one field? Sent from my iPhone > On 15 Jan, 2016, at 7:33 pm, Andy Davidson > wrote: > > Are you using 1.6.0 or an older version? > > I think I remember something in 1.5.1 saying save was not implemented in > python. > > > The current doc does

[Spark-SQL] from_unixtime with user-specified timezone

2016-01-18 Thread Jerry Lam
Hi spark users and developers, what do you do if you want the from_unixtime function in spark sql to return the timezone you want instead of the system timezone? Best Regards, Jerry

Re: [Spark-SQL] from_unixtime with user-specified timezone

2016-01-18 Thread Jerry Lam
nk I will make a UDF if this is the only way out of the box. Thanks! Jerry On Mon, Jan 18, 2016 at 2:32 PM, Alexander Pivovarov wrote: > Look at > to_utc_timestamp > > from_utc_timestamp > On Jan 18, 2016 9:39 AM, "Jerry Lam" wrote: > >> Hi spark user

Re: sqlContext.cacheTable("tableName") vs dataFrame.cache()

2016-01-19 Thread Jerry Lam
Is cacheTable similar to asTempTable before? Sent from my iPhone > On 19 Jan, 2016, at 4:18 am, George Sigletos wrote: > > Thanks Kevin for your reply. > > I was suspecting the same thing as well, although it still does not make much > sense to me why would you need to do both: > myData.cach

Re: Spark, Mesos, Docker and S3

2016-01-26 Thread Jerry Lam
Hi Mao, Can you try --jars to include those jars? Best Regards, Jerry Sent from my iPhone > On 26 Jan, 2016, at 7:02 pm, Mao Geng wrote: > > Hi there, > > I am trying to run Spark on Mesos using a Docker image as executor, as > mentioned > http://spark.apache.org/d

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jerry Lam
I think spark dataframe supports more than just SQL. It is more like pandas dataframe.( I rarely use the SQL feature. ) There are a lot of novelties in dataframe so I think it is quite optimize for many tasks. The in-memory data structure is very memory efficient. I just change a very slow RDD prog

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jerry Lam
Hi Michael, Is there a section in the spark documentation demonstrate how to serialize arbitrary objects in Dataframe? The last time I did was using some User Defined Type (copy from VectorUDT). Best Regards, Jerry On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust wrote: > A princi

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jerry Lam
ou say why you think it is the case? Relative to what other ways? Best Regards, Jerry On Tue, Feb 2, 2016 at 8:59 PM, Nirav Patel wrote: > I dont understand why one thinks RDD of case object doesn't have > types(schema) ? If spark can convert RDD to DataFrame which means it > un

Union of RDDs without the overhead of Union

2016-02-02 Thread Jerry Lam
same output text file? Thank you! Jerry

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-03 Thread Jerry Lam
t. My understanding is that Dataset is the new RDD. Best Regards, Jerry Sent from my iPhone > On 3 Feb, 2016, at 12:26 am, Koert Kuipers wrote: > > with respect to joins, unfortunately not all implementations are available. > for example i would like to use joins where one side is s

Re: Spark 1.5.2 memory error

2016-02-03 Thread Jerry Lam
what I fine tune most. Making sure the task/core has enough memory to execute to completion. Some times you really don't know how much data you keep in memory until you profile your application. (calculate some statistics help). Best Regards, Jerry On Wed, Feb 3, 2016 at 4:58 PM, Nirav P

Re: Convert Iterable to RDD

2016-02-12 Thread Jerry Lam
Not sure if I understand your problem well but why don't you create the file locally and then upload to hdfs? Sent from my iPhone > On 12 Feb, 2016, at 9:09 am, "seb.arzt" wrote: > > I have an Iterator of several million elements, which unfortunately won't fit > into the driver memory at the s

Re: Streaming with broadcast joins

2016-02-19 Thread Jerry Lam
something I've misunderstood. Best Regards, Jerry Sent from my iPhone > On 19 Feb, 2016, at 10:20 am, Sebastian Piu wrote: > > I don't have the code with me now, and I ended moving everything to RDD in > the end and using map operations to do some lookups, i.e. instead

Pattern Matching over a Sequence of rows using Spark

2016-02-28 Thread Jerry Lam
that do that, please share your findings! Thank you, Jerry

Spark SQL: Native Support for LATERAL VIEW EXPLODE

2015-09-25 Thread Jerry Lam
union'' expected but identifier view found with the query look like: "select items from purhcases lateral view explode(purchase_items) tbl as items" Best Regards, Jerry

Re: Spark SQL: Native Support for LATERAL VIEW EXPLODE

2015-09-26 Thread Jerry Lam
dataframe? Best Regards, Jerry On Fri, Sep 25, 2015 at 7:53 PM, Michael Armbrust wrote: > The SQL parser without HiveContext is really simple, which is why I > generally recommend users use HiveContext. However, you can do it with > dataframes: > > import org.apache.spar

Spark SQL: Implementing Custom Data Source

2015-09-28 Thread Jerry Lam
Hi spark users and developers, I'm trying to learn how implement a custom data source for Spark SQL. Is there a documentation that I can use as a reference? I'm not sure exactly what needs to be extended/implemented. A general workflow will be greatly helpful! Best Regards, Jerry

Re: Spark SQL: Implementing Custom Data Source

2015-09-29 Thread Jerry Lam
Hi Michael and Ted, Thank you for the reference. Is it true that once I implement a custom data source, it can be used in all spark supported language? That is Scala, Java, Python and R. :) I want to take advantage of the interoperability that is already built in spark. Thanks! Jerry On Tue

spark-submit --packages using different resolver

2015-10-01 Thread Jerry Lam
ivate repo? Thank you! Jerry

Re: Limiting number of cores per job in multi-threaded driver.

2015-10-04 Thread Jerry Lam
Philip, the guy is trying to help you. Calling him silly is a bit too far. He might assume your problem is IO bound which might not be the case. If you need only 4 cores per job no matter what there is little advantage to use spark in my opinion because you can easily do this with just a worker

Re: spark-submit --packages using different resolver

2015-10-06 Thread Jerry Lam
support s3 repo. I will file a jira ticket for this. Best Regards, Jerry On Sat, Oct 3, 2015 at 12:50 PM, Burak Yavuz wrote: > Hi Jerry, > > The --packages feature doesn't support private repositories right now. > However, in the case of s3, maybe it might work. Could you pleas

Re: spark-submit --packages using different resolver

2015-10-06 Thread Jerry Lam
This is the ticket SPARK-10951 <https://issues.apache.org/jira/browse/SPARK-10951> Cheers~ On Tue, Oct 6, 2015 at 11:33 AM, Jerry Lam wrote: > Hi Burak, > > Thank you for the tip. > Unfortunately it does not work. It throws: > > java.net.MalformedURLException: unk

Re: Dataframes - sole data structure for parallel computations?

2015-10-08 Thread Jerry Lam
I just read the article by ogirardot but I don’t agree It is like saying pandas dataframe is the sole data structure for analyzing data in python. Can Pandas dataframe replace Numpy array? The answer is simply no from an efficiency perspective for some computations. Unless there is a computer s

Re: Indexing Support

2015-10-18 Thread Jerry Lam
I'm interested in it but I doubt there is r-tree indexing support in the near future as spark is not a database. You might have a better luck looking at databases with spatial indexing support out of the box. Cheers Sent from my iPad On 2015-10-18, at 17:16, Mustafa Elbehery wrote: > Hi All

Re: Spark executor on Mesos - how to set effective user id?

2015-10-19 Thread Jerry Lam
Can you try setting SPARK_USER at the driver? It is used to impersonate users at the executor. So if you have user setup for launching spark jobs on the executor machines, simply set it to that user name for SPARK_USER. There is another configuration that will prevents jobs being launched with a

Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Jerry Lam
,LongType,true), StructField(type,StringType,true)) As you can see the schema does not match. The nullable field is set to true for timestamp upon reading the dataframe back. Is there a way to preserve the schema so that what we write to will be what we read back? Best Regards, Jerry

Spark SQL: Issues with using DirectParquetOutputCommitter with APPEND mode and OVERWRITE mode

2015-10-22 Thread Jerry Lam
y a create? Best Regards, Jerry

[Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-24 Thread Jerry Lam
frame) - org.apache.spark.sql.DataFrameWriter.parquet(java.lang.String) @bci=8, line=304 (Interpreted frame) Best Regards, Jerry

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
ate over 1 million files. Not sure why it OOM the driver after the job is marked _SUCCESS in the output folder. Best Regards, Jerry On Sat, Oct 24, 2015 at 9:35 PM, Jerry Lam wrote: > Hi Spark users and developers, > > Does anyone encounter any issue when a spark SQL job produces a lot

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
parameters to make it more memory efficient? Best Regards, Jerry On Sun, Oct 25, 2015 at 8:39 PM, Jerry Lam wrote: > Hi guys, > > After waiting for a day, it actually causes OOM on the spark driver. I > configure the driver to have 6GB. Note that I didn't call refresh myself. > T

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
mory which is a bit odd in my opinion. Any help will be greatly appreciated. Best Regards, Jerry On Sun, Oct 25, 2015 at 9:25 PM, Josh Rosen wrote: > Hi Jerry, > > Do you have speculation enabled? A write which produces one million files > / output partitions might be using to

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
) org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:31) org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:395) org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:267) On Sun, Oct 25, 2015 at 10:25 PM, Jerry Lam wrote: > Hi Josh, > > No I don't have speculation enabl

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-26 Thread Jerry Lam
partition is over 100. Best Regards, Jerry Sent from my iPhone > On 26 Oct, 2015, at 2:50 am, Fengdong Yu wrote: > > How many partitions you generated? > if Millions generated, then there is a huge memory consumed. > > > > > >> On Oct 26, 2015, at 10:5

[Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Jerry Lam
concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Any idea why it can read the schema from the parquet file but not processing the file? It feels like the hadoop configuration is not sent to the executor for some reasons... Thanks, Jerry

Re: [Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Jerry Lam
load the parquet file but I cannot perform a count on the parquet file because of the AmazonClientException. It means that the credential is used during the loading of the parquet but not when we are processing the parquet file. How this can happen? Best Regards, Jerry On Tue, Oct 27, 2015 a

Re: [Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Jerry Lam
t;key", "value") does not propagate through all SQL jobs within the same SparkContext? I haven't try with Spark Core so I cannot tell. Is there a workaround given it seems to be broken? I need to do this programmatically after the SparkContext is instantiated not before... Best Regard

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Jerry Lam
Hi Bryan, Did you read the email I sent few days ago. There are more issues with partitionBy down the road: https://www.mail-archive.com/user@spark.apache.org/msg39512.html <https://www.mail-archive.com/user@spark.apache.org/msg39512.html> Best Regards, Jerry > On Oct 28, 2015, a

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Jerry Lam
takes awhile to initialize the partition table and it requires a lot of memory from the driver. I would not use it if the number of partition go over a few hundreds. Hope this help, Jerry Sent from my iPhone > On 28 Oct, 2015, at 6:33 pm, Bryan wrote: > > Jerry, > > Thank you f

Exception in thread "main" java.lang.IllegalArgumentException: Positive number of slices required

2015-10-29 Thread Jerry Wong
I used the spark 1.3.1 to populate the event logs to Cassandra. But there is an exception that I could not find out any clauses. Can anybody give me any helps? Exception in thread "main" java.lang.IllegalArgumentException: Positive number of slices required at org.apache.spark.rdd.ParallelCollect

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
f this actually works or not. :) Best Regards, Jerry On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers wrote: > hello all, > i am trying to get familiar with spark sql partitioning support. > > my data is partitioned by date, so like this: > data/date=2015-01-01 > data/date=

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
Hi Koert, You should be able to see if it requires scanning the whole data by "explain" the query. The physical plan should say something about it. I wonder if you are trying the distinct-sort-by-limit approach or the max-date approach? Best Regards, Jerry On Sun, Nov 1, 2015

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
to be faster though. > > On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lam wrote: > >> Hi Koert, >> >> You should be able to see if it requires scanning the whole data by >> "explain" the query. The physical plan should say something about it. I >> wonder

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
physical plan, you can navigate the actual execution in the web UI to see how much data is actually read to satisfy this request. I hope it only requires a few bytes for few dates. Best Regards, Jerry On Sun, Nov 1, 2015 at 5:56 PM, Jerry Lam wrote: > I agreed the max date will satisfy

Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread Jerry Lam
We "used" Spark on Mesos to build interactive data analysis platform because the interactive session could be long and might not use Spark for the entire session. It is very wasteful of resources if we used the coarse-grained mode because it keeps resource for the entire session. Therefore, fine-gr

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Jerry Lam
Does Qubole use Yarn or Mesos for resource management? Sent from my iPhone > On 5 Nov, 2015, at 9:02 pm, Sabarish Sasidharan > wrote: > > Qubole - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional comma

[Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
$1.apply$mcZ$sp(SparkILoopExt.scala:127) at org.apache.spark.repl.SparkILoopExt$$anonfun$process$1.apply(SparkILoopExt.scala:113) at org.apache.spark.repl.SparkILoopExt$$anonfun$process$1.apply(SparkILoopExt.scala:113) Best Regards, Jerry

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
own Source) at org.apache.derby.jdbc.Driver20.connect(Unknown Source) at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source) at java.sql.DriverManager.getConnection(DriverManager.java:571) Best Regards, Jerry > On Nov 6, 2015, at 12:12 PM, Ted Yu w

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
e above call. > > FYI > > On Fri, Nov 6, 2015 at 8:53 AM, Jerry Lam <mailto:chiling...@gmail.com>> wrote: > Hi spark users and developers, > > Is it possible to disable HiveContext from being instantiated when using > spark-shell? I got the following errors when

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
. /home/jerry directory). It will give me an exception like below. Since I don’t use HiveContext, I don’t see the need to maintain a database. What is interesting is that pyspark shell is able to start more than 1 session at the same time. I wonder what pyspark has done better than spark-shell

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
spark.sql.hive.enabled false configuration would be lovely too. :) Just an additional bonus is that it requires less memory if we don’t use HiveContext on the driver side (~100-200MB) from a rough observation. Thanks and have a nice weekend! Jerry > On Nov 6, 2015, at 5:53 PM, Ted Yu wrote: > >

Re: Very slow startup for jobs containing millions of tasks

2015-11-14 Thread Jerry Lam
Hi Ted, That looks exactly what happens. It has been 5 hrs now. The code was built for 1.4. Thank you very much! Best Regards, Jerry Sent from my iPhone > On 14 Nov, 2015, at 11:21 pm, Ted Yu wrote: > > Which release are you using ? > If older than 1.5.0, you miss some f

Re: spark-ec2 vs. EMR

2015-12-01 Thread Jerry Lam
Simply put: EMR = Hadoop Ecosystem (Yarn, HDFS, etc) + Spark + EMRFS + Amazon EMR API + Selected Instance Types + Amazon EC2 Friendly (bootstrapping) spark-ec2 = HDFS + Yarn (Optional) + Spark (Standalone Default) + Any Instance Type I use spark-ec2 for prototyping and I have never use it for p

Re: spark-ec2 vs. EMR

2015-12-02 Thread Jerry Lam
w you to do that without manual process. Best Regards, Jerry On Wed, Dec 2, 2015 at 1:02 PM, Alexander Pivovarov wrote: > Do you think it's a security issue if EMR started in VPC with a subnet > having Auto-assign Public IP: Yes > > you can remove all Inbound rules having 0.0.

Re: df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-12-02 Thread Jerry Lam
-ff...@mail.gmail.com%3E> Sorry, I don’t have a solution. Best Regards, Jerry > On Dec 2, 2015, at 5:11 PM, Don Drake wrote: > > Does anyone have any suggestions on creating a large amount of parquet files? > Especially in regards to the last phase where it creates the _met

Re: Counting distinct values for a key?

2015-07-19 Thread Jerry Lam
Hi Nikunj, Sorry, I totally misread your question. I think you need to first groupbykey (get all values of the same key together), then follow by mapValues (probably put the values into a set and then take the size of it because you want a distinct count) HTH, Jerry Sent from my iPhone >

Re: Spark Mesos Dispatcher

2015-07-19 Thread Jerry Lam
Yes. Sent from my iPhone > On 19 Jul, 2015, at 10:52 pm, "Jahagirdar, Madhu" > wrote: > > All, > > Can we run different version of Spark using the same Mesos Dispatcher. For > example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ? > > Regards, > Madhu Jahagirdar > > Th

Re: Spark Mesos Dispatcher

2015-07-19 Thread Jerry Lam
sing that dispatcher ? > ------ > *From:* Jerry Lam [chiling...@gmail.com] > *Sent:* Monday, July 20, 2015 8:27 AM > *To:* Jahagirdar, Madhu > *Cc:* user; d...@spark.apache.org > *Subject:* Re: Spark Mesos Dispatcher > > Yes. > > Sent from my iPhone > > On 19 Jul,

Re: Parquet problems

2015-07-22 Thread Jerry Lam
Hi guys, I noticed that too. Anders, can you confirm that it works on Spark 1.5 snapshot? This is what I tried at the end. It seems it is 1.4 issue. Best Regards, Jerry On Wed, Jul 22, 2015 at 11:46 AM, Anders Arpteg wrote: > No, never really resolved the problem, except by increasing

Re: Partition parquet data by ENUM column

2015-07-23 Thread Jerry Lam
lared type (org.apache.parquet.io.api.Binary) does not match the schema found in file metadata. Column item is of type: FullTypeDescriptor(PrimitiveType: BINARY, OriginalType: ENUM) Valid types for this column are: null Is it because Spark does not recognize ENUM type in parquet? Best Regards, Jerry On Wed, Jul 22, 201

Unexpected performance issues with Spark SQL using Parquet

2015-07-27 Thread Jerry Lam
pping. The speed is 4x faster in the data-without-mapping that means that the more columns a parquet file has the slower it is even only a specific column is needed. Anyone has an explanation on this? I was expecting both of them will finish approximate the same time. Best Regards, Jerry

  1   2   >