A very simple example is sql("select create_map(1, 'a', 2, 'b')") .union(sql("select create_map(2, 'b', 1, 'a')")) .distinct
By definition a map should not care about the order of its entries, so the above query should return one record. However it returns 2 records before SPARK-19893 On Sat, Jan 13, 2018 at 11:51 AM, HariKrishnan CK <ckhar...@gmail.com> wrote: > Hi Wan, could you please be more specific on the scenarios where it will > give wrong results. I checked distinct and intersect operators in many use > cases i have and could not figure out a failure scenario giving wrong > results. > > Thanks > > > On Jan 12, 2018 7:36 PM, "Wenchen Fan" <cloud0...@gmail.com> wrote: > > Actually Spark 2.1.0 doesn't work for your case, it may give you wrong > result... > We are still working on adding this feature, but before that, we should > fail earlier instead of returning wrong result. > > On Sat, Jan 13, 2018 at 11:02 AM, ckhari4u <ckhar...@gmail.com> wrote: > >> I see SPARK-19893 is backported to Spark 2.1 and 2.0.1 as well. I do not >> see >> a clear justification for why SPARK 19893 is important and needed. I have >> a >> sample table which works fine with an earlier build of Spark 2.1.0. Now >> that >> the latest build is having the backport of SPARK-19893, its failing with >> error: >> >> Error in query: Cannot have map type columns in DataFrame which calls set >> operations(intersect, except, etc.), but the type of column metrics is >> map<string,int>;; >> Distinct >> >> >> *In Old Build of Spark 2.1.0, I tried the below:* >> >> >> create TABLE map_demo2 >> ( >> country_id BIGINT, >> metrics MAP <STRING, int> >> ); >> >> insert into table map_demo2 select 2,map("chaka",102) ; >> insert into table map_demo2 select 3,map("chaka",102) ; >> insert into table map_demo2 select 4,map("mangaa",103) ; >> >> >> spark-sql> select distinct metrics from map_demo2; >> [Stage 0:> (0 + >> 4) >> / 5]18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8501 milliseconds >> to >> create the Initialization Vector used by CryptoStream >> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8503 milliseconds to >> create the Initialization Vector used by CryptoStream >> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8497 milliseconds to >> create the Initialization Vector used by CryptoStream >> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8496 milliseconds to >> create the Initialization Vector used by CryptoStream >> [Stage 1:============================ >> <https://maps.google.com/?q=1:%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3E%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0(1%5BStage&entry=gmail&source=g> >> === >> <https://maps.google.com/?q=1:%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3E%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0(1%5BStage&entry=gmail&source=g> >> > (1[Stage >> <https://maps.google.com/?q=1:%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3E%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0(1%5BStage&entry=gmail&source=g> >> 1:===========================================> (1[Stage >> 1:======================================================>(1 >> {"mangaa":103} >> {"chaka":102} >> {"chaka":103} >> Time taken: 15.331 seconds, Fetched 3 row(s) >> >> Here the simple distinct query works fine in Spark. Any thoughts why >> DISTINCT/EXCEPT/INTERSECT operators are not supported on Map data types. >> From the PR, it says, >> // TODO: although map type is not orderable, technically map type should >> be >> able to be >> + // used inequality comparison, remove this type check once we >> support it. >> >> Could not figure out the issue caused by using the aforementioned >> operators? >> >> >> >> >> >> -- >> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> > >