Re: spark + parquet + schema name and metadata

Cheng Lian Thu, 24 Sep 2015 09:34:07 -0700

Thanks for the feedback, just filedhttps://issues.apache.org/jira/browse/SPARK-10803 to track this issue.


Cheng


On 9/24/15 4:25 AM, Borisa Zivkovic wrote:

Hi,

your suggestion works nicely.. I was able to attach metadata tocolumns and read that metadata from spark and by using ParquetFileReaderIt would be nice if we had a way to manipulate parquet metadatadirectly from DataFrames though.


regards

On Wed, 23 Sep 2015 at 09:25 Borisa Zivkovic<borisha.zivko...@gmail.com <mailto:borisha.zivko...@gmail.com>> wrote:


    Hi,

    thanks a lot for this! I will try it out to see if this works ok.

    I am planning to use "stable" metadata - so those will be same
    across all parquet files inside directory hierarchy...



    On Tue, 22 Sep 2015 at 18:54 Cheng Lian <lian.cs....@gmail.com
    <mailto:lian.cs....@gmail.com>> wrote:

        Michael reminded me that although we don't support direct
        manipulation over Parquet metadata, you can still save/query
        metadata to/from Parquet via DataFrame per-column metadata.
        For example:

        import sqlContext.implicits._
        import org.apache.spark.sql.types.MetadataBuilder

        val path = "file:///tmp/parquet/meta"

        // Saving metadata
        val meta = new MetadataBuilder().putString("appVersion",
        "1.0.2").build()
        sqlContext.range(10).select($"id".as("id",
        meta)).coalesce(1).write.mode("overwrite").parquet(path)

        // Querying metadata
        
sqlContext.read.parquet(path).schema("id").metadata.getString("appVersion")

        The metadata is saved together with Spark SQL schema as a JSON
        string. For example, the above code generates the following
        Parquet metadata (inspected with parquet-meta):

        file:
        
file:/private/tmp/parquet/meta/part-r-00000-77cb2237-e6a8-4cb6-a452-ae205ba7b660.gz.parquet
        creator:     parquet-mr version 1.6.0
        extra: org.apache.spark.sql.parquet.row.metadata =
        
{"type":"struct","fields":[{"name":"id","type":"long","nullable":true,*"metadata":{"appVersion":"1.0.2"}*}]}


        Cheng


        On 9/22/15 9:37 AM, Cheng Lian wrote:

        I see, this makes sense. We should probably add this in Spark
        SQL.

        However, there's one corner case to note about user-defined
        Parquet metadata. When committing a write job,
        ParquetOutputCommitter writes Parquet summary files
        (_metadata and _common_metadata), and user-defined key-value
        metadata written in all Parquet part-files get merged here.
        The problem is that, if a single key is associated with
        multiple values, Parquet doesn't know how to reconcile this
        situation, and simply gives up writing summary files. This
        can be particular annoying for appending. In general, users
        should avoid storing "unstable" values like timestamps as
        Parquet metadata.

        Cheng

        On 9/22/15 1:58 AM, Borisa Zivkovic wrote:

        thanks for answer.

        I need this in order to be able to track schema metadata.

        basically when I create parquet files from Spark I want to
        be able to "tag" them in some way (giving the schema
        appropriate name or attaching some key/values) and then it
        is fairly easy to get basic metadata about parquet files
        when processing and discovering those later on.

        On Mon, 21 Sep 2015 at 18:17 Cheng Lian
        <lian.cs....@gmail.com <mailto:lian.cs....@gmail.com>> wrote:

            Currently Spark SQL doesn't support customizing schema
            name and
            metadata. May I know why these two matters in your use
            case? Some
            Parquet data models, like parquet-avro, do support it,
            while some others
            don't (e.g. parquet-hive).

            Cheng

            On 9/21/15 7:13 AM, Borisa Zivkovic wrote:
            > Hi,
            >
            > I am trying to figure out how to write parquet
            metadata when
            > persisting DataFrames to parquet using Spark (1.4.1)
            >
            > I could not find a way to change schema name (which
            seems to be
            > hardcoded to root) and also how to add data to
            key/value metadata in
            > parquet footer.
            >
            >
            org.apache.parquet.hadoop.metadata.FileMetaData#getKeyValueMetaData
            >
            > org.apache.parquet.schema.Type#getName
            >
            > thanks
            >
            >

Re: spark + parquet + schema name and metadata

Reply via email to