Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread leerho
I have placed a [DISCUSS] thread on our d...@datasketches.apache.org list if you wish to suggest some ideas! :) On Fri, Aug 14, 2020 at 4:06 PM leerho wrote: > The other option would be to deprecate the Hive SketchState update(...) > method and create a "newUpdate(...) method that has strings en

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread leerho
The other option would be to deprecate the Hive SketchState update(...) method and create a "newUpdate(...) method that has strings encode with UTF-8. And also document the reason why. Any other ideas? On Fri, Aug 14, 2020 at 4:03 PM leerho wrote: > Yep! It turns out that there is already an

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread leerho
Yep! It turns out that there is already an issue on this that was reported 18 days ago. Changing this will be fraught with problems as other Hive users may have a history of sketches created with Strings encoded as char[]. I'm not

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread Marko Mušnjak
Hi, It does seem the first two days (probably from Spark+Hive UDFs) merged by themselves, closely match the exact count of 11034. The other 12 days (built using Kafka Streams) taken together also closely match the exact count for the period. That would mean we have our cause here. Now to discove

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread leerho
Hi Marko, As I stated before the first 2 sketches are the result of union operations, while the rest are not. I get the following: All 14 sketches : 34530 Without the first day : 27501; your count 24890; Error = 10.5% This is already way off. it represents an error of nearly 7 standard deviat

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread Alexander Saydakov
Since you are mixing sketches built in different environments, have you ever tested that the input strings are hashed the same way? There is a chance that strings might be represented differently in Hive and Spark, and therefore the resulting sketches might be disjoint while you might believe that

Re: HLL Union and lgK config

2020-08-14 Thread Marko Mušnjak
Hi, The sketches are string-fed. Some of the sketches are built using Spark and the Hive functions from the datasketches library, while others are built using a kafka streams job. It's quite likely the covered period contains some sketches built by Spark and some by the streaming job, but I can't

Re: HLL Union and lgK config

2020-08-14 Thread leerho
Hi Marko, I notice that the first two sketches are the result of union operations, while the remaining sketches are pure streaming sketches. Could you perform Jon's request again except excluding the first two sketches? Just to cover the bases, could you explain the types of the data items that ar