All,
One issue I'm seeing is that I start the thrift server (for jdbc access)
via the following: /spark/spark-1.4.1/sbin/start-thriftserver.sh --master
spark://master:7077 --hiveconf "spark.cores.max=2"
After about 40 seconds the Thrift server is started and available on
default port 10000.
I then submit my application - and the application throws the following
error:
Caused by: java.sql.SQLException: Failed to start database 'metastore_db'
with class loader
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@6a552721, see
the next exception for details.
at
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
Source)
at
org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
Source)
... 86 more
Caused by: java.sql.SQLException: Another instance of Derby may have
already booted the database /spark/spark-1.4.1/metastore_db.
at
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
Source)
at
org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
Source)
at
org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown
Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown
Source)
... 83 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted
the database /spark/spark-1.4.1/metastore_db.
This also happens if I do the opposite (submit the application first, and
then start the thrift server).
It looks similar to the following issue -- but not quite the same:
https://issues.apache.org/jira/browse/SPARK-9776
It seems like this set of steps works fine if the metadata database is not
yet created - but once it's created this happens every time. Is this a
known issue? Is there a workaround?
Regards,
Bryan Jeffrey
On Wed, Oct 28, 2015 at 3:13 PM, Bryan Jeffrey <[email protected]>
wrote:
> Susan,
>
> I did give that a shot -- I'm seeing a number of oddities:
>
> (1) 'Partition By' appears only accepts alphanumeric lower case fields.
> It will work for 'machinename', but not 'machineName' or 'machine_name'.
> (2) When partitioning with maps included in the data I get odd string
> conversion issues
> (3) When partitioning without maps I see frequent out of memory issues
>
> I'll update this email when I've got a more concrete example of problems.
>
> Regards,
>
> Bryan Jeffrey
>
>
>
> On Wed, Oct 28, 2015 at 1:33 PM, Susan Zhang <[email protected]> wrote:
>
>> Have you tried partitionBy?
>>
>> Something like
>>
>> hiveWindowsEvents.foreachRDD( rdd => {
>> val eventsDataFrame = rdd.toDF()
>> eventsDataFrame.write.mode(SaveMode.Append).partitionBy("
>> windows_event_time_bin").saveAsTable("windows_event")
>> })
>>
>>
>>
>> On Wed, Oct 28, 2015 at 7:41 AM, Bryan Jeffrey <[email protected]>
>> wrote:
>>
>>> Hello.
>>>
>>> I am working to get a simple solution working using Spark SQL. I am
>>> writing streaming data to persistent tables using a HiveContext. Writing
>>> to a persistent non-partitioned table works well - I update the table using
>>> Spark streaming, and the output is available via Hive Thrift/JDBC.
>>>
>>> I create a table that looks like the following:
>>>
>>> 0: jdbc:hive2://localhost:10000> describe windows_event;
>>> describe windows_event;
>>> +--------------------------+---------------------+----------+
>>> | col_name | data_type | comment |
>>> +--------------------------+---------------------+----------+
>>> | target_entity | string | NULL |
>>> | target_entity_type | string | NULL |
>>> | date_time_utc | timestamp | NULL |
>>> | machine_ip | string | NULL |
>>> | event_id | string | NULL |
>>> | event_data | map<string,string> | NULL |
>>> | description | string | NULL |
>>> | event_record_id | string | NULL |
>>> | level | string | NULL |
>>> | machine_name | string | NULL |
>>> | sequence_number | string | NULL |
>>> | source | string | NULL |
>>> | source_machine_name | string | NULL |
>>> | task_category | string | NULL |
>>> | user | string | NULL |
>>> | additional_data | map<string,string> | NULL |
>>> | windows_event_time_bin | timestamp | NULL |
>>> | # Partition Information | | |
>>> | # col_name | data_type | comment |
>>> | windows_event_time_bin | timestamp | NULL |
>>> +--------------------------+---------------------+----------+
>>>
>>>
>>> However, when I create a partitioned table and write data using the
>>> following:
>>>
>>> hiveWindowsEvents.foreachRDD( rdd => {
>>> val eventsDataFrame = rdd.toDF()
>>>
>>> eventsDataFrame.write.mode(SaveMode.Append).saveAsTable("windows_event")
>>> })
>>>
>>> The data is written as though the table is not partitioned (so
>>> everything is written to
>>> /user/hive/warehouse/windows_event/file.gz.paquet. Because the data is not
>>> following the partition schema, it is not accessible (and not partitioned).
>>>
>>> Is there a straightforward way to write to partitioned tables using
>>> Spark SQL? I understand that the read performance for partitioned data is
>>> far better - are there other performance improvements that might be better
>>> to use instead of partitioning?
>>>
>>> Regards,
>>>
>>> Bryan Jeffrey
>>>
>>
>>
>