date:20210104

[jira] [Commented] (ARROW-11120) [Python][R] Prove out plumbing to pass data between Python and R using rpy2

2021-01-04 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258106#comment-17258106
 ] 

Joris Van den Bossche commented on ARROW-11120:
---

[~lgautier] Your summary workflow (and linked repo) doesn't seem to be using 
the C interface, but trying to directly pass a pointer to an actual Arrow C++ 
array?  

Just to be sure, did you see the {{_export_to_c}} (eg at 
https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi#L1236-L1289),
 which gives you the raw pointer from the C interface. The R arrow package 
should be able to create a R object from this AFAIK. 
There are also some tests using CFFI in python to interact with the C 
interface: 
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_cffi.py

(note I don't know the internals of rpy2, so not sure of any of those pointers 
is relevant here)

> [Python][R] Prove out plumbing to pass data between Python and R using rpy2
> ---
>
> Key: ARROW-11120
> URL: https://issues.apache.org/jira/browse/ARROW-11120
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python, R
>Reporter: Wes McKinney
>Priority: Major
>
> Per discussion on the mailing list, we should see what is required (if 
> anything) to be able to pass data structures using the C interface between 
> Python and R from the perspective of the Python user using rpy2. rpy2 is sort 
> of the Python version of reticulate. Unit tests will then validate that it's 
> working



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-10998) [C++] Filesystems: detect if URI is passed where a file path is required and raise informative error

2021-01-04 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-10998:
--
Fix Version/s: 3.0.0

> [C++] Filesystems: detect if URI is passed where a file path is required and 
> raise informative error
> 
>
> Key: ARROW-10998
> URL: https://issues.apache.org/jira/browse/ARROW-10998
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: filesystem
> Fix For: 3.0.0
>
>
> Currently, when passing a URI to a filesystem method (except for 
> {{from_uri}}) or other functions that accept a filesystem object, you can get 
> a rather cryptic error message (eg in this case about "No response body" for 
> S3, in the example below). 
> Ideally, the filesystem object knows its own prefix "scheme", and so can 
> detect if a user is passing a URI instead of file path, and we can provide a 
> nicer error message.
> Example with S3:
> {code:python}
> >>> from pyarrow.fs import S3FileSystem
> >>> fs = S3FileSystem(region="us-east-2")
> >>> fs.get_file_info('s3://ursa-labs-taxi-data/2016/01/')
> ...
> OSError: When getting information for key '/ursa-labs-taxi-data/2016/01' in 
> bucket 's3:': AWS Error [code 100]: No response body.
> >>> import pyarrow.parquet as pq
> >>> table = pq.read_table('s3://ursa-labs-taxi-data/2016/01/data.parquet', 
> >>> filesystem=fs)
> ...
> OSError: When getting information for key 
> '/ursa-labs-taxi-data/2016/01/data.parquet' in bucket 's3:': AWS Error [code 
> 100]: No response body.
> {code}
> With a local filesystem, you actually get a not found file:
> {code: python}
> >>> fs = LocalFileSystem()
> >>> fs.get_file_info("file:///home")
> 
> {code}
> cc [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11024) [C++] Writing List to parquet sometimes writes wrong data

2021-01-04 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-11024:
--
Summary: [C++] Writing List to parquet sometimes writes wrong data  
(was: Writing List to parquet sometimes writes wrong data)

> [C++] Writing List to parquet sometimes writes wrong data
> -
>
> Key: ARROW-11024
> URL: https://issues.apache.org/jira/browse/ARROW-11024
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: macOS Catalina, Python 3.7.3, Pyarrow 2.0.0
>Reporter: George Deamont
>Priority: Major
>
>  Sometimes when writing tables that contain List columns, the data is 
> written incorrectly. Here is a code sample that produces the error. There are 
> no exceptions raised here, but a simple equality check via equals() yields 
> False for the second test case... 
>  
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> # Write small amount of data to parquet file, and read it back. In this case, 
> both tables are equal.
> data1 = [[{'x':'abc','y':'abc'}]]*100 + [[{'x':'abc','y':'gcb'}]]*100
> array1 = pa.array(data1)
> table1 = pa.table([array1],names=['column'])
> pq.write_table(table1,'temp1.parquet')
> table1_1 = pq.read_table('temp1.parquet')
> print(table1_1.equals(table1))
> # Write larger amount of data to parquet file, and read it back. In this 
> case, the tables are not equal.
> data2 = data1*100
> array2 = pa.array(data2)
> table2 = pa.table([array2],names=['column'])
> pq.write_table(table2,'temp2.parquet')
> table2_1 = pq.read_table('temp2.parquet')
> print(table2_1.equals(table2))
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11024) [C++][Parquet] Writing List to parquet sometimes writes wrong data

2021-01-04 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-11024:
--
Summary: [C++][Parquet] Writing List to parquet sometimes writes 
wrong data  (was: [C++] Writing List to parquet sometimes writes wrong 
data)

> [C++][Parquet] Writing List to parquet sometimes writes wrong data
> --
>
> Key: ARROW-11024
> URL: https://issues.apache.org/jira/browse/ARROW-11024
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: macOS Catalina, Python 3.7.3, Pyarrow 2.0.0
>Reporter: George Deamont
>Priority: Major
>
>  Sometimes when writing tables that contain List columns, the data is 
> written incorrectly. Here is a code sample that produces the error. There are 
> no exceptions raised here, but a simple equality check via equals() yields 
> False for the second test case... 
>  
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> # Write small amount of data to parquet file, and read it back. In this case, 
> both tables are equal.
> data1 = [[{'x':'abc','y':'abc'}]]*100 + [[{'x':'abc','y':'gcb'}]]*100
> array1 = pa.array(data1)
> table1 = pa.table([array1],names=['column'])
> pq.write_table(table1,'temp1.parquet')
> table1_1 = pq.read_table('temp1.parquet')
> print(table1_1.equals(table1))
> # Write larger amount of data to parquet file, and read it back. In this 
> case, the tables are not equal.
> data2 = data1*100
> array2 = pa.array(data2)
> table2 = pa.table([array2],names=['column'])
> pq.write_table(table2,'temp2.parquet')
> table2_1 = pq.read_table('temp2.parquet')
> print(table2_1.equals(table2))
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11024) [C++][Parquet] Writing List to parquet sometimes writes wrong data

2021-01-04 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258124#comment-17258124
 ] 

Joris Van den Bossche commented on ARROW-11024:
---

[~georgedeamont] thanks for the report! I can confirm that this doesn't 
roundtrip correctly on pyarrow 2.0. But, it seems already fixed on master (and 
we plan to release 3.0 in around 2 weeks).  

I suppose it was probably fixed by ARROW-10493

> [C++][Parquet] Writing List to parquet sometimes writes wrong data
> --
>
> Key: ARROW-11024
> URL: https://issues.apache.org/jira/browse/ARROW-11024
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: macOS Catalina, Python 3.7.3, Pyarrow 2.0.0
>Reporter: George Deamont
>Priority: Major
>
>  Sometimes when writing tables that contain List columns, the data is 
> written incorrectly. Here is a code sample that produces the error. There are 
> no exceptions raised here, but a simple equality check via equals() yields 
> False for the second test case... 
>  
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> # Write small amount of data to parquet file, and read it back. In this case, 
> both tables are equal.
> data1 = [[{'x':'abc','y':'abc'}]]*100 + [[{'x':'abc','y':'gcb'}]]*100
> array1 = pa.array(data1)
> table1 = pa.table([array1],names=['column'])
> pq.write_table(table1,'temp1.parquet')
> table1_1 = pq.read_table('temp1.parquet')
> print(table1_1.equals(table1))
> # Write larger amount of data to parquet file, and read it back. In this 
> case, the tables are not equal.
> data2 = data1*100
> array2 = pa.array(data2)
> table2 = pa.table([array2],names=['column'])
> pq.write_table(table2,'temp2.parquet')
> table2_1 = pq.read_table('temp2.parquet')
> print(table2_1.equals(table2))
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11024) [C++][Parquet] Writing List to parquet sometimes writes wrong data

2021-01-04 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258129#comment-17258129
 ] 

Joris Van den Bossche commented on ARROW-11024:
---

Yes, it could also be reproduced using pyarrow 2.0 with the smaller dataset 
when limiting the size of the row groups:

{code}
In [47]: import pyarrow as pa
...: import pyarrow.parquet as pq
...: 
...: # Write small amount of data to parquet file, and read it back. In 
this case, both tables are equal.
...: data1 = [[{'x':'abc','y':'abc'}]]*100 + [[{'x':'abc','y':'gcb'}]]*100
...: array1 = pa.array(data1)
...: table1 = pa.table([array1],names=['column'])
...: pq.write_table(table1,'temp1.parquet', row_group_size=20)
...: table1_1 = pq.read_table('temp1.parquet')
...: print(table1_1.equals(table1))
False
{code}

So this was fixed by ARROW-10493. Adding an additional python test maybe can't 
hurt. Will quickly do that.

> [C++][Parquet] Writing List to parquet sometimes writes wrong data
> --
>
> Key: ARROW-11024
> URL: https://issues.apache.org/jira/browse/ARROW-11024
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: macOS Catalina, Python 3.7.3, Pyarrow 2.0.0
>Reporter: George Deamont
>Priority: Major
>
>  Sometimes when writing tables that contain List columns, the data is 
> written incorrectly. Here is a code sample that produces the error. There are 
> no exceptions raised here, but a simple equality check via equals() yields 
> False for the second test case... 
>  
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> # Write small amount of data to parquet file, and read it back. In this case, 
> both tables are equal.
> data1 = [[{'x':'abc','y':'abc'}]]*100 + [[{'x':'abc','y':'gcb'}]]*100
> array1 = pa.array(data1)
> table1 = pa.table([array1],names=['column'])
> pq.write_table(table1,'temp1.parquet')
> table1_1 = pq.read_table('temp1.parquet')
> print(table1_1.equals(table1))
> # Write larger amount of data to parquet file, and read it back. In this 
> case, the tables are not equal.
> data2 = data1*100
> array2 = pa.array(data2)
> table2 = pa.table([array2],names=['column'])
> pq.write_table(table2,'temp2.parquet')
> table2_1 = pq.read_table('temp2.parquet')
> print(table2_1.equals(table2))
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11024) [C++][Parquet] Writing List to parquet sometimes writes wrong data

2021-01-04 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-11024:
--
Fix Version/s: 3.0.0

> [C++][Parquet] Writing List to parquet sometimes writes wrong data
> --
>
> Key: ARROW-11024
> URL: https://issues.apache.org/jira/browse/ARROW-11024
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: macOS Catalina, Python 3.7.3, Pyarrow 2.0.0
>Reporter: George Deamont
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
>  Sometimes when writing tables that contain List columns, the data is 
> written incorrectly. Here is a code sample that produces the error. There are 
> no exceptions raised here, but a simple equality check via equals() yields 
> False for the second test case... 
>  
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> # Write small amount of data to parquet file, and read it back. In this case, 
> both tables are equal.
> data1 = [[{'x':'abc','y':'abc'}]]*100 + [[{'x':'abc','y':'gcb'}]]*100
> array1 = pa.array(data1)
> table1 = pa.table([array1],names=['column'])
> pq.write_table(table1,'temp1.parquet')
> table1_1 = pq.read_table('temp1.parquet')
> print(table1_1.equals(table1))
> # Write larger amount of data to parquet file, and read it back. In this 
> case, the tables are not equal.
> data2 = data1*100
> array2 = pa.array(data2)
> table2 = pa.table([array2],names=['column'])
> pq.write_table(table2,'temp2.parquet')
> table2_1 = pq.read_table('temp2.parquet')
> print(table2_1.equals(table2))
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-11024) [C++][Parquet] Writing List to parquet sometimes writes wrong data

2021-01-04 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-11024:
-

Assignee: Joris Van den Bossche

> [C++][Parquet] Writing List to parquet sometimes writes wrong data
> --
>
> Key: ARROW-11024
> URL: https://issues.apache.org/jira/browse/ARROW-11024
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: macOS Catalina, Python 3.7.3, Pyarrow 2.0.0
>Reporter: George Deamont
>Assignee: Joris Van den Bossche
>Priority: Major
>
>  Sometimes when writing tables that contain List columns, the data is 
> written incorrectly. Here is a code sample that produces the error. There are 
> no exceptions raised here, but a simple equality check via equals() yields 
> False for the second test case... 
>  
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> # Write small amount of data to parquet file, and read it back. In this case, 
> both tables are equal.
> data1 = [[{'x':'abc','y':'abc'}]]*100 + [[{'x':'abc','y':'gcb'}]]*100
> array1 = pa.array(data1)
> table1 = pa.table([array1],names=['column'])
> pq.write_table(table1,'temp1.parquet')
> table1_1 = pq.read_table('temp1.parquet')
> print(table1_1.equals(table1))
> # Write larger amount of data to parquet file, and read it back. In this 
> case, the tables are not equal.
> data2 = data1*100
> array2 = pa.array(data2)
> table2 = pa.table([array2],names=['column'])
> pq.write_table(table2,'temp2.parquet')
> table2_1 = pq.read_table('temp2.parquet')
> print(table2_1.equals(table2))
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11024) [C++][Parquet] Writing List to parquet sometimes writes wrong data

2021-01-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11024:
---
Labels: pull-request-available  (was: )

> [C++][Parquet] Writing List to parquet sometimes writes wrong data
> --
>
> Key: ARROW-11024
> URL: https://issues.apache.org/jira/browse/ARROW-11024
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: macOS Catalina, Python 3.7.3, Pyarrow 2.0.0
>Reporter: George Deamont
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
>  Sometimes when writing tables that contain List columns, the data is 
> written incorrectly. Here is a code sample that produces the error. There are 
> no exceptions raised here, but a simple equality check via equals() yields 
> False for the second test case... 
>  
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> # Write small amount of data to parquet file, and read it back. In this case, 
> both tables are equal.
> data1 = [[{'x':'abc','y':'abc'}]]*100 + [[{'x':'abc','y':'gcb'}]]*100
> array1 = pa.array(data1)
> table1 = pa.table([array1],names=['column'])
> pq.write_table(table1,'temp1.parquet')
> table1_1 = pq.read_table('temp1.parquet')
> print(table1_1.equals(table1))
> # Write larger amount of data to parquet file, and read it back. In this 
> case, the tables are not equal.
> data2 = data1*100
> array2 = pa.array(data2)
> table2 = pa.table([array2],names=['column'])
> pq.write_table(table2,'temp2.parquet')
> table2_1 = pq.read_table('temp2.parquet')
> print(table2_1.equals(table2))
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11006) [Python] Array to_numpy slow compared to Numpy.view

2021-01-04 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258162#comment-17258162
 ] 

Joris Van den Bossche commented on ARROW-11006:
---

Did a quick profile of {{[pa_arr.to_numpy(zero_copy_only=True) for _ in 
range(N)]}}, and eg around 20% of the time is spent in creating a ChunkedArray 
from the single Array, because all actual conversion functions are written for 
chunked arrays ({{ConvertArrayToPandas}} calls 
{{ConvertChunkedArrayToPandas}}). So not sure that is easily avoidable. 

Another part is spent in creating/destroying the {{IntWriter}} class which does 
the actual conversion.

(to be clear, performance improvements are certainly welcome! I am not fully 
sure how much room for improvement there is, and the {{to_numpy}} will always 
be slower as the fast numpy-native view)

> [Python] Array to_numpy slow compared to Numpy.view
> ---
>
> Key: ARROW-11006
> URL: https://issues.apache.org/jira/browse/ARROW-11006
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Paul Balanca
>Assignee: Paul Balanca
>Priority: Minor
>
> The method `to_numpy` is quite slow compare Numpy slice and viewing 
> performance. For instance:
> {code:java}
> N = 100
> np_arr = np.arange(N)
> pa_arr = pa.array(np_arr)
> %timeit l = [np_arr.view() for _ in range(N)]
> 251 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> %timeit l = [pa_arr.to_numpy(zero_copy_only=True) for _ in range(N)]
> 1.2 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> {code}
> The previous benchmark is clearly an extreme case, but the idea is that for 
> any operation not available in PyArrow, failing back on Numpy is a good 
> option and the cost of extracting should be as minimal as possible (there are 
> scenarios where you can't cache easily this view, so you end up calling 
> `to_numpy` a fair amount of times).
> I would believe that a bit part of this overhead is due to PyArrow 
> implementing a very generic Pandas conversion, and using this one even for 
> very simple Numpy-like dense arrays.
> There are a lot of use cases of PyArrow <=> Numpy interaction projects where 
> I think most would be interested in not paying any Pandas compatibility 
> additional cost. And in this particular case, it could be valuable to 
> implement a direct Numpy conversion method for some Array subclasses 
> (starting with the simple `NumericArray`).
> `



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11007) [Python] Memory leak in pq.read_table and table.to_pandas

2021-01-04 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258165#comment-17258165
 ] 

Joris Van den Bossche commented on ARROW-11007:
---

Given that this comes up from time to time, it might be useful to document this 
to some extent: expectations around watching memory usage (explaining that the 
deallocated memory might be cached by the memory allocator etc), how you can 
actually see how much memory is used (total_allocated_bytes), how you can check 
if high "apparent" memory usage is indeed related to this and not caused by a 
memory leak (use pa.jemalloc_set_decay_ms(0); run your function many times in a 
loop and see it stabilizes or keeps constantly growing), ..

> [Python] Memory leak in pq.read_table and table.to_pandas
> -
>
> Key: ARROW-11007
> URL: https://issues.apache.org/jira/browse/ARROW-11007
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
>Reporter: Michael Peleshenko
>Priority: Major
>
> While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we 
> observed a memory leak in the read_table and to_pandas methods. See below for 
> sample code to reproduce it. Memory does not seem to be returned after 
> deleting the table and df as it was in pyarrow 0.12.1.
> *Sample Code*
> {code:python}
> import io
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from memory_profiler import profile
> @profile
> def read_file(f):
> table = pq.read_table(f)
> df = table.to_pandas(strings_to_categorical=True)
> del table
> del df
> def main():
> rows = 200
> df = pd.DataFrame({
> "string": ["test"] * rows,
> "int": [5] * rows,
> "float": [2.0] * rows,
> })
> table = pa.Table.from_pandas(df, preserve_index=False)
> parquet_stream = io.BytesIO()
> pq.write_table(table, parquet_stream)
> for i in range(3):
> parquet_stream.seek(0)
> read_file(parquet_stream)
> if __name__ == '__main__':
> main()
> {code}
> *Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs*
> {code:java}
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #Mem usageIncrement  Occurences   Line Contents
> 
>  9161.7 MiB161.7 MiB   1   @profile
> 10 def read_file(f):
> 11212.1 MiB 50.4 MiB   1   table = pq.read_table(f)
> 12258.2 MiB 46.1 MiB   1   df = 
> table.to_pandas(strings_to_categorical=True)
> 13258.2 MiB  0.0 MiB   1   del table
> 14256.3 MiB -1.9 MiB   1   del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #Mem usageIncrement  Occurences   Line Contents
> 
>  9256.3 MiB256.3 MiB   1   @profile
> 10 def read_file(f):
> 11279.2 MiB 23.0 MiB   1   table = pq.read_table(f)
> 12322.2 MiB 43.0 MiB   1   df = 
> table.to_pandas(strings_to_categorical=True)
> 13322.2 MiB  0.0 MiB   1   del table
> 14320.3 MiB -1.9 MiB   1   del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #Mem usageIncrement  Occurences   Line Contents
> 
>  9320.3 MiB320.3 MiB   1   @profile
> 10 def read_file(f):
> 11326.9 MiB  6.5 MiB   1   table = pq.read_table(f)
> 12361.7 MiB 34.8 MiB   1   df = 
> table.to_pandas(strings_to_categorical=True)
> 13361.7 MiB  0.0 MiB   1   del table
> 14359.8 MiB -1.9 MiB   1   del df
> {code}
> *Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs*
> {code:java}
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #Mem usageIncrement  Occurences   Line Contents
> 
>  9138.4 MiB138.4 MiB   1   @profile
> 10 def read_file(f):
> 11186.2 MiB 47.8 MiB   1   table = pq.read_table(f)
> 12219.2 MiB 33.0 MiB   1   df = 
> table.to_pandas(strings_to_categorical=True)
> 13171.7 MiB-47.5 MiB   1   del table
> 14139.3 MiB-32.4 MiB   1   del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #Mem usageIncrement  Occurences   Line Co

[jira] [Updated] (ARROW-11057) [Python] Data inconsistency with read and write

2021-01-04 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-11057:
--
Description: 
I have been reading and writing some tables to parquet and I found some 
inconsistencies.
{code:java}
# create a table with some data
a = pa.Table.from_pydict({'x': [1]*100,'y': [2]*100,'z': [3]*100,})
# write it to file
pq.write_table(a, 'test.parquet')
# read the same file
b = pq.read_table('test.parquet')
# a == b is True, that's good
# write table b to file
pq.write_table(b, 'test2.parquet')
# test is different from test2{code}
Basically it is:
 * Create table in memory
 * Write it to file
 * Read it again
 * Write it to a different file

The files are not the same. The second one contains extra information.

The differences are consistent across different compressions (I tried snappy 
and zstd).

Also, reading the second file and and writing it again, produces the same file.

Is this a bug or an expected behavior?

 

 

 

  was:
I have been reading and writing some tables to parquet and I found some 
inconsistencies.
{code:java}
# create a table with some data
a = pa.Table.from_pydict({'x': [1]*100,'y': [2]*100,'z': [3]*100,})
# write it to file
pq.write_table(a, 'test.parquet')
# read the same file
b = pq.write_table('test.parquet')
# a == b is True, that's good
# write table b to file
pq.write_table(b, 'test2.parquet')
# test is different from test2{code}
Basically it is:
 * Create table in memory
 * Write it to file
 * Read it again
 * Write it to a different file

The files are not the same. The second one contains extra information.

The differences are consistent across different compressions (I tried snappy 
and zstd).

Also, reading the second file and and writing it again, produces the same file.

Is this a bug or an expected behavior?

 

 

 


> [Python] Data inconsistency with read and write
> ---
>
> Key: ARROW-11057
> URL: https://issues.apache.org/jira/browse/ARROW-11057
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
>Reporter: David Quijano
>Priority: Major
>
> I have been reading and writing some tables to parquet and I found some 
> inconsistencies.
> {code:java}
> # create a table with some data
> a = pa.Table.from_pydict({'x': [1]*100,'y': [2]*100,'z': [3]*100,})
> # write it to file
> pq.write_table(a, 'test.parquet')
> # read the same file
> b = pq.read_table('test.parquet')
> # a == b is True, that's good
> # write table b to file
> pq.write_table(b, 'test2.parquet')
> # test is different from test2{code}
> Basically it is:
>  * Create table in memory
>  * Write it to file
>  * Read it again
>  * Write it to a different file
> The files are not the same. The second one contains extra information.
> The differences are consistent across different compressions (I tried snappy 
> and zstd).
> Also, reading the second file and and writing it again, produces the same 
> file.
> Is this a bug or an expected behavior?
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11057) [Python] Data inconsistency with read and write

2021-01-04 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258168#comment-17258168
 ] 

Joris Van den Bossche commented on ARROW-11057:
---

You can indeed also see in pyarrow that the only difference between {{a}} and 
{{b}} is the metadata of the schema:

{code}
In [20]: a.equals(b)
Out[20]: True

In [21]: a.equals(b, check_metadata=True)
Out[21]: False

In [22]: a.schema
Out[22]: 
x: int64
y: int64
z: int64

In [23]: b.schema
Out[23]: 
x: int64
  -- field metadata --
  PARQUET:field_id: '1'
y: int64
  -- field metadata --
  PARQUET:field_id: '2'
z: int64
  -- field metadata --
  PARQUET:field_id: '3'
{code}

> [Python] Data inconsistency with read and write
> ---
>
> Key: ARROW-11057
> URL: https://issues.apache.org/jira/browse/ARROW-11057
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
>Reporter: David Quijano
>Priority: Major
>
> I have been reading and writing some tables to parquet and I found some 
> inconsistencies.
> {code:java}
> # create a table with some data
> a = pa.Table.from_pydict({'x': [1]*100,'y': [2]*100,'z': [3]*100,})
> # write it to file
> pq.write_table(a, 'test.parquet')
> # read the same file
> b = pq.read_table('test.parquet')
> # a == b is True, that's good
> # write table b to file
> pq.write_table(b, 'test2.parquet')
> # test is different from test2{code}
> Basically it is:
>  * Create table in memory
>  * Write it to file
>  * Read it again
>  * Write it to a different file
> The files are not the same. The second one contains extra information.
> The differences are consistent across different compressions (I tried snappy 
> and zstd).
> Also, reading the second file and and writing it again, produces the same 
> file.
> Is this a bug or an expected behavior?
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11057) [Python] Data inconsistency with read and write

2021-01-04 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258169#comment-17258169
 ] 

Joris Van den Bossche commented on ARROW-11057:
---

So I think we can close this as "expected behaviour" ?

> [Python] Data inconsistency with read and write
> ---
>
> Key: ARROW-11057
> URL: https://issues.apache.org/jira/browse/ARROW-11057
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
>Reporter: David Quijano
>Priority: Major
>
> I have been reading and writing some tables to parquet and I found some 
> inconsistencies.
> {code:java}
> # create a table with some data
> a = pa.Table.from_pydict({'x': [1]*100,'y': [2]*100,'z': [3]*100,})
> # write it to file
> pq.write_table(a, 'test.parquet')
> # read the same file
> b = pq.read_table('test.parquet')
> # a == b is True, that's good
> # write table b to file
> pq.write_table(b, 'test2.parquet')
> # test is different from test2{code}
> Basically it is:
>  * Create table in memory
>  * Write it to file
>  * Read it again
>  * Write it to a different file
> The files are not the same. The second one contains extra information.
> The differences are consistent across different compressions (I tried snappy 
> and zstd).
> Also, reading the second file and and writing it again, produces the same 
> file.
> Is this a bug or an expected behavior?
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11069) Parquet writer incorrect data being written when data type is dictionary

2021-01-04 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-11069:
--
Description: 
When writing a dict column using pyarrow. 

 
{code:python}
import pandas as pd

orig = pd.read_parquet("original.parquet")
orig.to_parquet("first_write.parquet")

first_write = pd.read_parquet("first_write.parquet")

print(orig.equals(first_write))
{code}


 
 This incorrect results start appearing after index 1024. first_write.parquet 
was created after reading and then writing it again. I don't see any obvious 
pattern in the shuffled rows.

!image-2020-12-30-01-20-45-183.png!
  Original records
 !image-2020-12-30-01-19-20-491.png!

Written records

  was:
When writing a dict column using pyarrow. 

 
{code:python}
import pandas as pd

orig = pd.read_parquet("original.parquet")
df.to_parquet("first_write.parquet")

first_write = pd.read_parquet("first_write.parquet")

print(orig.equals(first_write))
{code}


 
 This incorrect results start appearing after index 1024. first_write.parquet 
was created after reading and then writing it again. I don't see any obvious 
pattern in the shuffled rows.

!image-2020-12-30-01-20-45-183.png!
  Original records
 !image-2020-12-30-01-19-20-491.png!

Written records


> Parquet writer incorrect data being written when data type is dictionary
> 
>
> Key: ARROW-11069
> URL: https://issues.apache.org/jira/browse/ARROW-11069
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: pandas v1.0.4
>Reporter: Palash Goel
>Priority: Major
> Attachments: first_write.parquet, image-2020-12-30-01-19-20-491.png, 
> image-2020-12-30-01-19-42-739.png, image-2020-12-30-01-20-45-183.png, 
> original.parquet
>
>
> When writing a dict column using pyarrow. 
>  
> {code:python}
> import pandas as pd
> orig = pd.read_parquet("original.parquet")
> orig.to_parquet("first_write.parquet")
> first_write = pd.read_parquet("first_write.parquet")
> print(orig.equals(first_write))
> {code}
>  
>  This incorrect results start appearing after index 1024. first_write.parquet 
> was created after reading and then writing it again. I don't see any obvious 
> pattern in the shuffled rows.
> !image-2020-12-30-01-20-45-183.png!
>   Original records
>  !image-2020-12-30-01-19-20-491.png!
> Written records



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11069) [C++] Parquet writer incorrect data being written when data type is struct

2021-01-04 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-11069:
--
Summary: [C++] Parquet writer incorrect data being written when data type 
is struct  (was: Parquet writer incorrect data being written when data type is 
dictionary)

> [C++] Parquet writer incorrect data being written when data type is struct
> --
>
> Key: ARROW-11069
> URL: https://issues.apache.org/jira/browse/ARROW-11069
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: pandas v1.0.4
>Reporter: Palash Goel
>Priority: Major
> Attachments: first_write.parquet, image-2020-12-30-01-19-20-491.png, 
> image-2020-12-30-01-19-42-739.png, image-2020-12-30-01-20-45-183.png, 
> original.parquet
>
>
> When writing a dict column using pyarrow. 
>  
> {code:python}
> import pandas as pd
> orig = pd.read_parquet("original.parquet")
> orig.to_parquet("first_write.parquet")
> first_write = pd.read_parquet("first_write.parquet")
> print(orig.equals(first_write))
> {code}
>  
>  This incorrect results start appearing after index 1024. first_write.parquet 
> was created after reading and then writing it again. I don't see any obvious 
> pattern in the shuffled rows.
> !image-2020-12-30-01-20-45-183.png!
>   Original records
>  !image-2020-12-30-01-19-20-491.png!
> Written records



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11069) [C++] Parquet writer incorrect data being written when data type is struct

2021-01-04 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-11069:
--
Fix Version/s: 3.0.0

> [C++] Parquet writer incorrect data being written when data type is struct
> --
>
> Key: ARROW-11069
> URL: https://issues.apache.org/jira/browse/ARROW-11069
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: pandas v1.0.4
>Reporter: Palash Goel
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: first_write.parquet, image-2020-12-30-01-19-20-491.png, 
> image-2020-12-30-01-19-42-739.png, image-2020-12-30-01-20-45-183.png, 
> original.parquet
>
>
> When writing a dict column using pyarrow. 
>  
> {code:python}
> import pandas as pd
> orig = pd.read_parquet("original.parquet")
> orig.to_parquet("first_write.parquet")
> first_write = pd.read_parquet("first_write.parquet")
> print(orig.equals(first_write))
> {code}
>  
>  This incorrect results start appearing after index 1024. first_write.parquet 
> was created after reading and then writing it again. I don't see any obvious 
> pattern in the shuffled rows.
> !image-2020-12-30-01-20-45-183.png!
>   Original records
>  !image-2020-12-30-01-19-20-491.png!
> Written records



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11069) [C++] Parquet writer incorrect data being written when data type is struct

2021-01-04 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258172#comment-17258172
 ] 

Joris Van den Bossche commented on ARROW-11069:
---

[~palashgoel7] Thanks for the report! This has in the meantime already been 
fixed by ARROW-10493, and this fix will be included in pyarrow 3.0 release (to 
be released somewhere in January)

> [C++] Parquet writer incorrect data being written when data type is struct
> --
>
> Key: ARROW-11069
> URL: https://issues.apache.org/jira/browse/ARROW-11069
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: pandas v1.0.4
>Reporter: Palash Goel
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: first_write.parquet, image-2020-12-30-01-19-20-491.png, 
> image-2020-12-30-01-19-42-739.png, image-2020-12-30-01-20-45-183.png, 
> original.parquet
>
>
> When writing a dict column using pyarrow. 
>  
> {code:python}
> import pandas as pd
> orig = pd.read_parquet("original.parquet")
> orig.to_parquet("first_write.parquet")
> first_write = pd.read_parquet("first_write.parquet")
> print(orig.equals(first_write))
> {code}
>  
>  This incorrect results start appearing after index 1024. first_write.parquet 
> was created after reading and then writing it again. I don't see any obvious 
> pattern in the shuffled rows.
> !image-2020-12-30-01-20-45-183.png!
>   Original records
>  !image-2020-12-30-01-19-20-491.png!
> Written records



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-11069) [C++] Parquet writer incorrect data being written when data type is struct

2021-01-04 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-11069.
-
Resolution: Duplicate

> [C++] Parquet writer incorrect data being written when data type is struct
> --
>
> Key: ARROW-11069
> URL: https://issues.apache.org/jira/browse/ARROW-11069
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: pandas v1.0.4
>Reporter: Palash Goel
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: first_write.parquet, image-2020-12-30-01-19-20-491.png, 
> image-2020-12-30-01-19-42-739.png, image-2020-12-30-01-20-45-183.png, 
> original.parquet
>
>
> When writing a dict column using pyarrow. 
>  
> {code:python}
> import pandas as pd
> orig = pd.read_parquet("original.parquet")
> orig.to_parquet("first_write.parquet")
> first_write = pd.read_parquet("first_write.parquet")
> print(orig.equals(first_write))
> {code}
>  
>  This incorrect results start appearing after index 1024. first_write.parquet 
> was created after reading and then writing it again. I don't see any obvious 
> pattern in the shuffled rows.
> !image-2020-12-30-01-20-45-183.png!
>   Original records
>  !image-2020-12-30-01-19-20-491.png!
> Written records



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11069) [C++] Parquet writer incorrect data being written when data type is struct

2021-01-04 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258175#comment-17258175
 ] 

Joris Van den Bossche commented on ARROW-11069:
---

I included a simplified version of this case in a test added in 
https://github.com/apache/arrow/pull/9091

> [C++] Parquet writer incorrect data being written when data type is struct
> --
>
> Key: ARROW-11069
> URL: https://issues.apache.org/jira/browse/ARROW-11069
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: pandas v1.0.4
>Reporter: Palash Goel
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: first_write.parquet, image-2020-12-30-01-19-20-491.png, 
> image-2020-12-30-01-19-42-739.png, image-2020-12-30-01-20-45-183.png, 
> original.parquet
>
>
> When writing a dict column using pyarrow. 
>  
> {code:python}
> import pandas as pd
> orig = pd.read_parquet("original.parquet")
> orig.to_parquet("first_write.parquet")
> first_write = pd.read_parquet("first_write.parquet")
> print(orig.equals(first_write))
> {code}
>  
>  This incorrect results start appearing after index 1024. first_write.parquet 
> was created after reading and then writing it again. I don't see any obvious 
> pattern in the shuffled rows.
> !image-2020-12-30-01-20-45-183.png!
>   Original records
>  !image-2020-12-30-01-19-20-491.png!
> Written records



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-10930) [Python] LargeListType doesn't have a value_field

2021-01-04 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-10930.
---
Resolution: Fixed

Issue resolved by pull request 8979
[https://github.com/apache/arrow/pull/8979]

> [Python] LargeListType doesn't have a value_field
> -
>
> Key: ARROW-10930
> URL: https://issues.apache.org/jira/browse/ARROW-10930
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
>Reporter: Jim Pivarski
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This one is easy: it looks like the LargeListType is just missing this field. 
> Here it is for a 32-bit list (the reason I want this is to get at the 
> "nullable" field, although the "metadata" would be nice, too):
> {code:java}
> >>> import pyarrow as pa
> >>> small_array = pa.ListArray.from_arrays(pa.array([0, 3, 3, 5]), 
> >>> pa.array([1.1, 2.2, 3.3, 4.4, 5.5]))
> >>> small_array.type.value_field
> pyarrow.Field
> >>> small_array.type.value_field.nullable
> True{code}
> Now with a large list:
> {code:java}
> >>> large_array = pa.LargeListArray.from_arrays(pa.array([0, 3, 3, 5]), 
> >>> pa.array([1.1, 2.2, 3.3, 4.4, 5.5]))
> >>> large_array.type.value_field
> Traceback (most recent call last):
>  File "", line 1, in 
> AttributeError: 'pyarrow.lib.LargeListType' object has no attribute 
> 'value_field'{code}
> Verifying version:
> {code:java}
> >>> pa.__version__
> '2.0.0'{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-11095) [Python] Access pyarrow.RecordBatch column by name

2021-01-04 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-11095:
-

Assignee: Will Jones

> [Python] Access pyarrow.RecordBatch column by name
> --
>
> Key: ARROW-11095
> URL: https://issues.apache.org/jira/browse/ARROW-11095
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I propose adding support for selecting a column out of a pyarrow.RecordBatch 
> using both __getitem__() and .field(), like we have in pyarrow.Table.
> pyarrow.RecordBatch has a pretty similar API to pyarrow.Table (e.g. both have 
> filter and take methods and a schema), but I got tripped up on this 
> difference. pyarrow.Table supports accessing columns by name using both 
> __getitem__ and .field():
> {code:python}
> my_array = pa.array(range(10))
> table = pa.Table.from_arrays([my_array], names=['my_column'])
> // Both of these work on table:
> table['my_column']
> table.field('my_column')
> {code}
> Meanwhile pyarrow.RecordBatch doesn't support either of those. In fact, I had 
> a hard time finding a way to grab a column by name from a recordbatch without 
> first looking up the integer index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-11095) [Python] Access pyarrow.RecordBatch column by name

2021-01-04 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-11095.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 9073
[https://github.com/apache/arrow/pull/9073]

> [Python] Access pyarrow.RecordBatch column by name
> --
>
> Key: ARROW-11095
> URL: https://issues.apache.org/jira/browse/ARROW-11095
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I propose adding support for selecting a column out of a pyarrow.RecordBatch 
> using both __getitem__() and .field(), like we have in pyarrow.Table.
> pyarrow.RecordBatch has a pretty similar API to pyarrow.Table (e.g. both have 
> filter and take methods and a schema), but I got tripped up on this 
> difference. pyarrow.Table supports accessing columns by name using both 
> __getitem__ and .field():
> {code:python}
> my_array = pa.array(range(10))
> table = pa.Table.from_arrays([my_array], names=['my_column'])
> // Both of these work on table:
> table['my_column']
> table.field('my_column')
> {code}
> Meanwhile pyarrow.RecordBatch doesn't support either of those. In fact, I had 
> a hard time finding a way to grab a column by name from a recordbatch without 
> first looking up the integer index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11120) [Python][R] Prove out plumbing to pass data between Python and R using rpy2

2021-01-04 Thread Laurent (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258221#comment-17258221
 ] 

Laurent commented on ARROW-11120:
-

[~jorisvandenbossche] Thanks. I did confuse a few things, and came in with the 
(wrong) assumption that an R "External Pointer", traditionnaly the way to have 
pointers to external C or C++ structures, would be used.
In fact this _export_to_c` is used as you point it out (here in the R package: 
https://github.com/apache/arrow/blob/master/r/R/python.R#L26).

In the end the workflow is more like:
* load the R package "arrow" in the embedded R
* in R get pointers (`uintptr_t` cast to `double)` with 
`arrow::allocate_arrow_schema()` and `arrow::allocate_arrow_array()`
* in Python call the method `_export_to_c` using the two pointers above 
(casting back from `double` to `uintptr_t`) 
* in R call `arrow::ImportArray(array_ptr, schema_ptr)`

If correct this time, this is great news as it removes the cffi - 
C-extension/Cython issue.

> [Python][R] Prove out plumbing to pass data between Python and R using rpy2
> ---
>
> Key: ARROW-11120
> URL: https://issues.apache.org/jira/browse/ARROW-11120
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python, R
>Reporter: Wes McKinney
>Priority: Major
>
> Per discussion on the mailing list, we should see what is required (if 
> anything) to be able to pass data structures using the C interface between 
> Python and R from the perspective of the Python user using rpy2. rpy2 is sort 
> of the Python version of reticulate. Unit tests will then validate that it's 
> working



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-10132) [Rust] Considers scientific notation when inferring schema from csv

2021-01-04 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-10132:
---
Summary: [Rust] Considers scientific notation when inferring schema from 
csv  (was: Considers scientific notation when inferring schema from csv)

> [Rust] Considers scientific notation when inferring schema from csv
> ---
>
> Key: ARROW-10132
> URL: https://issues.apache.org/jira/browse/ARROW-10132
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.1
> Environment: Ubuntu
>Reporter: Ziru Niu
>Priority: Minor
>  Labels: easyfix
>
>  
> ||col||
> |1.2|
> |1.3e-2|
> |1.4|
> Currently this column would be inferred as Utf8 type, since 
> csv::reader::DECIMAL_RE is defined as r"^-?(\d+\.\d+)$". Maybe we could 
> change this to r"^-?(\d+\.\d+)(e-?(\d+))?$" or similar stuff to allow 
> scientific notation of real number inferred as float?
>  
> (The RE I currently proposed doesn't handle "5e-4" correctly though)
>  
> And I would wish we could infer "3." or ".3" as float too. I will come up 
> with an exact RE when I get time.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11120) [Python][R] Prove out plumbing to pass data between Python and R using rpy2

2021-01-04 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258261#comment-17258261
 ] 

Joris Van den Bossche commented on ARROW-11120:
---

I am not that familiar with the inner details, but from a high-level, that 
indeed sounds correct

(cc [~apitrou])

> [Python][R] Prove out plumbing to pass data between Python and R using rpy2
> ---
>
> Key: ARROW-11120
> URL: https://issues.apache.org/jira/browse/ARROW-11120
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python, R
>Reporter: Wes McKinney
>Priority: Major
>
> Per discussion on the mailing list, we should see what is required (if 
> anything) to be able to pass data structures using the C interface between 
> Python and R from the perspective of the Python user using rpy2. rpy2 is sort 
> of the Python version of reticulate. Unit tests will then validate that it's 
> working



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2021-01-04 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258263#comment-17258263
 ] 

Antoine Pitrou commented on ARROW-11067:


[~westonpace] Do you want to submit a PR for this or do you prefer me to do it?

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_explanation.png, arrow_failure_cases.csv, 
> arrow_failure_cases.csv, arrowbug1.png, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11120) [Python][R] Prove out plumbing to pass data between Python and R using rpy2

2021-01-04 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258268#comment-17258268
 ] 

Antoine Pitrou commented on ARROW-11120:


[~lgautier] Yes, this looks correct. Also don't forget the 
{{delete_arrow_schema}} and {{delete_arrow_array}} if you don't want to create 
a memory leak.

> [Python][R] Prove out plumbing to pass data between Python and R using rpy2
> ---
>
> Key: ARROW-11120
> URL: https://issues.apache.org/jira/browse/ARROW-11120
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python, R
>Reporter: Wes McKinney
>Priority: Major
>
> Per discussion on the mailing list, we should see what is required (if 
> anything) to be able to pass data structures using the C interface between 
> Python and R from the perspective of the Python user using rpy2. rpy2 is sort 
> of the Python version of reticulate. Unit tests will then validate that it's 
> working



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-5756) [Python] Remove manylinux1 support

2021-01-04 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258274#comment-17258274
 ] 

Joris Van den Bossche commented on ARROW-5756:
--

Some other links:

- Matti Picus wrote a blogpost about that it is time to drop manylinux1: 
https://labs.quansight.org/blog/2020/11/manylinux1-is-obsolete-manylinux2010-is-almost-eol-what-is-next/
- He raised the issue on the numpy mailing list 
(https://mail.python.org/pipermail/numpy-discussion/2020-November/081196.html) 
and python discussion site 
(https://discuss.python.org/t/blog-post-about-manylinux-and-the-future-of-manylinux1/5734)

Now, to what extent should we tie this to our support for Python 3.6?  
Because Python 3.6 ships with an older pip version (18) that does not yet 
support manylinux2010 (users can of course always (and probably _should_) 
upgrade the pip version, but if you don't do this, you can get quite cryptic 
error messages as it will try to install from source with older pip and if no 
manylinux1 wheel is available)


> [Python] Remove manylinux1 support
> --
>
> Key: ARROW-5756
> URL: https://issues.apache.org/jira/browse/ARROW-5756
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging, Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> We should decide when we want to stop producing manylinux1 packages. 
> Installing manylinux2010 packages requires a recent pip (and, obviously, a 
> not-too-antiquated system, but I think that's less of a problem for us).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-5756) [Python] Remove manylinux1 support

2021-01-04 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258277#comment-17258277
 ] 

Antoine Pitrou commented on ARROW-5756:
---

Kludge, but we could try to catch source build errors in {{setup.py}} and wrap 
them in a nice error message for end users.

> [Python] Remove manylinux1 support
> --
>
> Key: ARROW-5756
> URL: https://issues.apache.org/jira/browse/ARROW-5756
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging, Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> We should decide when we want to stop producing manylinux1 packages. 
> Installing manylinux2010 packages requires a recent pip (and, obviously, a 
> not-too-antiquated system, but I think that's less of a problem for us).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11124) [Doc] Update status matrix for Decimal256

2021-01-04 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-11124:
--

 Summary: [Doc] Update status matrix for Decimal256
 Key: ARROW-11124
 URL: https://issues.apache.org/jira/browse/ARROW-11124
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, Documentation
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 3.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11124) [Doc] Update status matrix for Decimal256

2021-01-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11124:
---
Labels: pull-request-available  (was: )

> [Doc] Update status matrix for Decimal256
> -
>
> Key: ARROW-11124
> URL: https://issues.apache.org/jira/browse/ARROW-11124
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Documentation
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-11124) [Doc] Update status matrix for Decimal256

2021-01-04 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-11124.

Resolution: Fixed

Issue resolved by pull request 8925
[https://github.com/apache/arrow/pull/8925]

> [Doc] Update status matrix for Decimal256
> -
>
> Key: ARROW-11124
> URL: https://issues.apache.org/jira/browse/ARROW-11124
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Documentation
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11065) [C++] Installation failed on AIX7.2

2021-01-04 Thread Xiaobo Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258284#comment-17258284
 ] 

Xiaobo Zhang commented on ARROW-11065:
--

According to [https://en.wikipedia.org/wiki/SSE4], SSE4 is available for Intel 
and AMD based CPU only so we should disable SSE4 option for Arrow CPP 
installation on AIX.  Is there an instruction to disable it?

Thanks.

> [C++] Installation failed on AIX7.2
> ---
>
> Key: ARROW-11065
> URL: https://issues.apache.org/jira/browse/ARROW-11065
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 2.0.0
> Environment: AIX7.2
>Reporter: Xiaobo Zhang
>Priority: Major
> Attachments: CMakeError.log, CMakeOutput.log
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> My installation of pyarrow on AIX7.2 failed due to missing ARROW and I was 
> told I have to install ARROW C++ first.  I downloaded ARROW 2.0.0 
> {color:#24292e}tar ball and tried to install its "cpp" component according to 
> the instruction.  However, I got the following error after {{cd release}} to 
> run {{cmake ..}}: {color}
>  
> {noformat}
> Login=root: Line=602 > cmake ..
> -- Building using CMake version: 3.16.0
> -- Arrow version: 2.0.0 (full: '2.0.0')
> -- Arrow SO version: 200 (full: 200.0.0)
> -- clang-tidy not found
> -- clang-format not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
> -- infer not found
> -- Found cpplint executable at 
> /software/thirdparty/apache-arrow-2.0.0/cpp/build-support/cpplint.py
> -- System processor: powerpc
> -- Arrow build warning level: PRODUCTION
> CMake Error at cmake_modules/SetupCxxFlags.cmake:365 (message):
>   SSE4.2 required but compiler doesn't support it.
> Call Stack (most recent call first):
>   CMakeLists.txt:437 (include)
> -- Configuring incomplete, errors occurred!
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeOutput.log".
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeError.log".
> {noformat}
> Attached are 2 CMake output/error files.  Sutou Kouhei suggested me to submit 
> an issue here.  Can someone please help me to fix the issue?  What do I have 
> to do with required SSE4.2?
> Thanks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-11121) [Developer] Use pull_request_target for PR JIRA integration

2021-01-04 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-11121.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 9087
[https://github.com/apache/arrow/pull/9087]

> [Developer] Use pull_request_target for PR JIRA integration
> ---
>
> Key: ARROW-11121
> URL: https://issues.apache.org/jira/browse/ARROW-11121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11065) [C++] Installation failed on AIX7.2

2021-01-04 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258306#comment-17258306
 ] 

Neal Richardson commented on ARROW-11065:
-

Does it work if you set {{-DARROW_SIMD_LEVEL=NONE}}?

cf. 
https://issues.apache.org/jira/browse/ARROW-9923?focusedCommentId=17226396&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17226396

> [C++] Installation failed on AIX7.2
> ---
>
> Key: ARROW-11065
> URL: https://issues.apache.org/jira/browse/ARROW-11065
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 2.0.0
> Environment: AIX7.2
>Reporter: Xiaobo Zhang
>Priority: Major
> Attachments: CMakeError.log, CMakeOutput.log
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> My installation of pyarrow on AIX7.2 failed due to missing ARROW and I was 
> told I have to install ARROW C++ first.  I downloaded ARROW 2.0.0 
> {color:#24292e}tar ball and tried to install its "cpp" component according to 
> the instruction.  However, I got the following error after {{cd release}} to 
> run {{cmake ..}}: {color}
>  
> {noformat}
> Login=root: Line=602 > cmake ..
> -- Building using CMake version: 3.16.0
> -- Arrow version: 2.0.0 (full: '2.0.0')
> -- Arrow SO version: 200 (full: 200.0.0)
> -- clang-tidy not found
> -- clang-format not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
> -- infer not found
> -- Found cpplint executable at 
> /software/thirdparty/apache-arrow-2.0.0/cpp/build-support/cpplint.py
> -- System processor: powerpc
> -- Arrow build warning level: PRODUCTION
> CMake Error at cmake_modules/SetupCxxFlags.cmake:365 (message):
>   SSE4.2 required but compiler doesn't support it.
> Call Stack (most recent call first):
>   CMakeLists.txt:437 (include)
> -- Configuring incomplete, errors occurred!
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeOutput.log".
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeError.log".
> {noformat}
> Attached are 2 CMake output/error files.  Sutou Kouhei suggested me to submit 
> an issue here.  Can someone please help me to fix the issue?  What do I have 
> to do with required SSE4.2?
> Thanks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-10840) [C++] FileMetaData does not have key_value_metadata when built from FileMetaDataBuilder

2021-01-04 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-10840.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 9012
[https://github.com/apache/arrow/pull/9012]

> [C++] FileMetaData does not have key_value_metadata when built from 
> FileMetaDataBuilder
> ---
>
> Key: ARROW-10840
> URL: https://issues.apache.org/jira/browse/ARROW-10840
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.0, 1.0.1, 2.0.0
>Reporter: Jimmy Lu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {{FileMetaDataImpl::key_value_metadata_}} is initialized in the constructor: 
> https://github.com/apache/arrow/blob/apache-arrow-1.0.0/cpp/src/parquet/metadata.cc#L530
> But in {{RowGroupMetaDataBuilder::Finish}}, {{FileMetaDataImpl::metadata_}} 
> is set after the construction and we are missing the 
> {{FileMetaDataImpl::InitKeyValueMetadata}} call here: 
> https://github.com/apache/arrow/blob/apache-arrow-1.0.0/cpp/src/parquet/metadata.cc#L1412



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-10840) [C++] Parquet FileMetaData does not have key_value_metadata when built from FileMetaDataBuilder

2021-01-04 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-10840:
---
Summary: [C++] Parquet FileMetaData does not have key_value_metadata when 
built from FileMetaDataBuilder  (was: [C++] FileMetaData does not have 
key_value_metadata when built from FileMetaDataBuilder)

> [C++] Parquet FileMetaData does not have key_value_metadata when built from 
> FileMetaDataBuilder
> ---
>
> Key: ARROW-10840
> URL: https://issues.apache.org/jira/browse/ARROW-10840
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.0, 1.0.1, 2.0.0
>Reporter: Jimmy Lu
>Assignee: Jimmy Lu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {{FileMetaDataImpl::key_value_metadata_}} is initialized in the constructor: 
> https://github.com/apache/arrow/blob/apache-arrow-1.0.0/cpp/src/parquet/metadata.cc#L530
> But in {{RowGroupMetaDataBuilder::Finish}}, {{FileMetaDataImpl::metadata_}} 
> is set after the construction and we are missing the 
> {{FileMetaDataImpl::InitKeyValueMetadata}} call here: 
> https://github.com/apache/arrow/blob/apache-arrow-1.0.0/cpp/src/parquet/metadata.cc#L1412



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-10840) [C++] FileMetaData does not have key_value_metadata when built from FileMetaDataBuilder

2021-01-04 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-10840:
--

Assignee: Jimmy Lu

> [C++] FileMetaData does not have key_value_metadata when built from 
> FileMetaDataBuilder
> ---
>
> Key: ARROW-10840
> URL: https://issues.apache.org/jira/browse/ARROW-10840
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.0, 1.0.1, 2.0.0
>Reporter: Jimmy Lu
>Assignee: Jimmy Lu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {{FileMetaDataImpl::key_value_metadata_}} is initialized in the constructor: 
> https://github.com/apache/arrow/blob/apache-arrow-1.0.0/cpp/src/parquet/metadata.cc#L530
> But in {{RowGroupMetaDataBuilder::Finish}}, {{FileMetaDataImpl::metadata_}} 
> is set after the construction and we are missing the 
> {{FileMetaDataImpl::InitKeyValueMetadata}} call here: 
> https://github.com/apache/arrow/blob/apache-arrow-1.0.0/cpp/src/parquet/metadata.cc#L1412



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11065) [C++] Installation failed on AIX7.2

2021-01-04 Thread Xiaobo Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258353#comment-17258353
 ] 

Xiaobo Zhang commented on ARROW-11065:
--

I issued the following "export" statement and checked EXTRA_CMAKE_FLAGS with 
"set" to make sure it is defined.  However, I still have the same error caused 
by SSE4.  See attached error log.  [^CMakeError.log]

Login=root: Line=661 > history
646 export EXTRA_CMAKE_FLAGS="-DARROW_SIMD_LEVEL=NONE"
Login=root: Line=662 >

 

 

> [C++] Installation failed on AIX7.2
> ---
>
> Key: ARROW-11065
> URL: https://issues.apache.org/jira/browse/ARROW-11065
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 2.0.0
> Environment: AIX7.2
>Reporter: Xiaobo Zhang
>Priority: Major
> Attachments: CMakeError.log, CMakeError.log, CMakeOutput.log
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> My installation of pyarrow on AIX7.2 failed due to missing ARROW and I was 
> told I have to install ARROW C++ first.  I downloaded ARROW 2.0.0 
> {color:#24292e}tar ball and tried to install its "cpp" component according to 
> the instruction.  However, I got the following error after {{cd release}} to 
> run {{cmake ..}}: {color}
>  
> {noformat}
> Login=root: Line=602 > cmake ..
> -- Building using CMake version: 3.16.0
> -- Arrow version: 2.0.0 (full: '2.0.0')
> -- Arrow SO version: 200 (full: 200.0.0)
> -- clang-tidy not found
> -- clang-format not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
> -- infer not found
> -- Found cpplint executable at 
> /software/thirdparty/apache-arrow-2.0.0/cpp/build-support/cpplint.py
> -- System processor: powerpc
> -- Arrow build warning level: PRODUCTION
> CMake Error at cmake_modules/SetupCxxFlags.cmake:365 (message):
>   SSE4.2 required but compiler doesn't support it.
> Call Stack (most recent call first):
>   CMakeLists.txt:437 (include)
> -- Configuring incomplete, errors occurred!
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeOutput.log".
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeError.log".
> {noformat}
> Attached are 2 CMake output/error files.  Sutou Kouhei suggested me to submit 
> an issue here.  Can someone please help me to fix the issue?  What do I have 
> to do with required SSE4.2?
> Thanks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11065) [C++] Installation failed on AIX7.2

2021-01-04 Thread Xiaobo Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Zhang updated ARROW-11065:
-
Attachment: CMakeError.log

> [C++] Installation failed on AIX7.2
> ---
>
> Key: ARROW-11065
> URL: https://issues.apache.org/jira/browse/ARROW-11065
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 2.0.0
> Environment: AIX7.2
>Reporter: Xiaobo Zhang
>Priority: Major
> Attachments: CMakeError.log, CMakeError.log, CMakeOutput.log
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> My installation of pyarrow on AIX7.2 failed due to missing ARROW and I was 
> told I have to install ARROW C++ first.  I downloaded ARROW 2.0.0 
> {color:#24292e}tar ball and tried to install its "cpp" component according to 
> the instruction.  However, I got the following error after {{cd release}} to 
> run {{cmake ..}}: {color}
>  
> {noformat}
> Login=root: Line=602 > cmake ..
> -- Building using CMake version: 3.16.0
> -- Arrow version: 2.0.0 (full: '2.0.0')
> -- Arrow SO version: 200 (full: 200.0.0)
> -- clang-tidy not found
> -- clang-format not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
> -- infer not found
> -- Found cpplint executable at 
> /software/thirdparty/apache-arrow-2.0.0/cpp/build-support/cpplint.py
> -- System processor: powerpc
> -- Arrow build warning level: PRODUCTION
> CMake Error at cmake_modules/SetupCxxFlags.cmake:365 (message):
>   SSE4.2 required but compiler doesn't support it.
> Call Stack (most recent call first):
>   CMakeLists.txt:437 (include)
> -- Configuring incomplete, errors occurred!
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeOutput.log".
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeError.log".
> {noformat}
> Attached are 2 CMake output/error files.  Sutou Kouhei suggested me to submit 
> an issue here.  Can someone please help me to fix the issue?  What do I have 
> to do with required SSE4.2?
> Thanks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-10624) [R] Too much R metadata for Parquet format

2021-01-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10624:
---
Labels: pull-request-available  (was: )

> [R] Too much R metadata for Parquet format
> --
>
> Key: ARROW-10624
> URL: https://issues.apache.org/jira/browse/ARROW-10624
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 2.0.0
> Environment: R version 3.6.1 (2019-07-05)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 19.10
>Reporter: Vinicius Pinto
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
> Attachments: debug-arrow.7z
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I cannot able to downgrade arrow from version 2.0.0 to 1.0.1 since 
> {{arrow::install_arrow()}} always installs the latest version.
> Steps to reproduce:
>  
> {code:java}
> devtools::install_version("arrow", version = "1.0.1")
> arrow::install_arrow(){code}
>  
> If I skip the {{arrow::install_arrow()}}, I am not able to use the gzip 
> compression ({{WARNING: Arrow Gzip is not available, try using 
> arrow::install_arrow()}})
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-4698) [C++] Let StringBuilder be constructible with a pre allocated buffer for character data

2021-01-04 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-4698:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++] Let StringBuilder be constructible with a pre allocated buffer for 
> character data
> ---
>
> Key: ARROW-4698
> URL: https://issues.apache.org/jira/browse/ARROW-4698
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Minor
> Fix For: 4.0.0
>
>
> This is useful for example when an existing buffer can be immediately reused. 
> This is currently used for [storage of strings in json 
> parsing](https://github.com/apache/arrow/blob/master/cpp/src/arrow/json/parser.cc#L60),
>  so it'd be straightforward to refactor into a constructor of StringBuilder.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7179) [C++][Compute] Array support for fill_null

2021-01-04 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-7179:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++][Compute] Array support for fill_null
> --
>
> Key: ARROW-7179
> URL: https://issues.apache.org/jira/browse/ARROW-7179
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 4.0.0
>
>
> Add kernels to support which replacing null values in an array with values 
> taken from corresponding slots in another array:
> {code}
> fill_null([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9543) [C++] Simplify parsing/conversion utilities

2021-01-04 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-9543:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++] Simplify parsing/conversion utilities
> ---
>
> Key: ARROW-9543
> URL: https://issues.apache.org/jira/browse/ARROW-9543
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.17.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Various improvement ideas extracted from 
> https://github.com/apache/arrow/pull/7793 , see 
> https://github.com/apache/arrow/commit/740d8132d3220b0bcddde138bd1ab70030e227fd
> - provide a convenience function FormatValue() to complement ParseValue()
> - parameterize both parsing and formatting with the corresponding DataType 
> subclass (for example a formatting or parsing a timestamp requires a TimeUnit 
> and this can be derived from a TimestampType)
> - rename StringConverter and StringFormatter to ParseValueTraits and 
> FormatValueTraits, to emphasize the convenience functions as the preferred 
> interface
> - ParseValue should accept a string_view for simplicity, and an overload 
> should be provided which returns an optional<>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-5745) [C++] properties of Map(Array|Type) are confusingly named

2021-01-04 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-5745:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++] properties of Map(Array|Type) are confusingly named
> -
>
> Key: ARROW-5745
> URL: https://issues.apache.org/jira/browse/ARROW-5745
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 4.0.0
>
>
> In the context of ListArrays, "values" indicates the elements in a slot of 
> the ListArray. Since MapArray isa ListArray, "values" indicates the same 
> thing and the elements are key-item pairs. This naming scheme is not 
> idiomatic; these *should* be called key-value pairs but that would require 
> propagating the renaming down to ListArray.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8630) [C++][Dataset] Pass schema including all materialized fields to catch CSV edge cases

2021-01-04 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-8630:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++][Dataset] Pass schema including all materialized fields to catch CSV 
> edge cases
> 
>
> Key: ARROW-8630
> URL: https://issues.apache.org/jira/browse/ARROW-8630
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset
> Fix For: 4.0.0
>
>
> see discussion here 
> https://github.com/apache/arrow/pull/7033#discussion_r416941674
> Fields filtered but not projected will revert to their inferred type, 
> whatever their dataset's schema may be. This can cause validated filters to 
> fail due to type disagreements



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6604) [C++] Add support for nested types to MakeArrayFromScalar

2021-01-04 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-6604:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++] Add support for nested types to MakeArrayFromScalar
> -
>
> Key: ARROW-6604
> URL: https://issues.apache.org/jira/browse/ARROW-6604
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 4.0.0
>
>
> At the same time move MakeArrayFromScalar and MakeArrayOfNull under 
> src/arrow/array/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-4706) [C++] shared conversion framework for JSON/CSV parsers

2021-01-04 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-4706:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++] shared conversion framework for JSON/CSV parsers
> --
>
> Key: ARROW-4706
> URL: https://issues.apache.org/jira/browse/ARROW-4706
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 4.0.0
>
>
> CSV and JSON both convert strings to values in a Array but there is little 
> code sharing beyond {{arrow::util::StringConverter}}.
> It would be advantageous if a single interface could be shared between CSV 
> and JSON to do the heavy lifting of conversion consistently. This would 
> simplify addition of new parsers as well as allowing all parsers to 
> immediately take advantage of a new conversion strategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7894) [C++] DefineOptions should invoke add_definitions

2021-01-04 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-7894:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++] DefineOptions should invoke add_definitions
> -
>
> Key: ARROW-7894
> URL: https://issues.apache.org/jira/browse/ARROW-7894
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Several build options are mirrored as preprocessor definitions, for example 
> \{{ARROW_JEMALLOC}}. This could be made more consistent by requiring that 
> every option in DefineOptions should also define a preprocessor macro with 
> {{add_definitions}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-5423) [C++] implement partial schema class to extend JSON conversion

2021-01-04 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-5423:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++] implement partial schema class to extend JSON conversion
> --
>
> Key: ARROW-5423
> URL: https://issues.apache.org/jira/browse/ARROW-5423
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 4.0.0
>
>
> Currently the JSON parser supports only basic conversion rules such as 
> parsing a number to {{int64}}. In general users will want more capable 
> conversions like parsing a base64 string into binary or parsing a column of 
> objects to {{map}} instead of {{struct}}. This 
> will require extension of {{arrow::json::ParseOptions::explicit_schema}} to 
> something analagous to a schema but which supports mapping to more than a 
> simple output type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-5327) [C++] allow construction of ArrayBuilders from existing arrays

2021-01-04 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-5327:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++] allow construction of ArrayBuilders from existing arrays
> --
>
> Key: ARROW-5327
> URL: https://issues.apache.org/jira/browse/ARROW-5327
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 4.0.0
>
>
> After calling Finish it may become necessary to append further elements to an 
> array, which we don't currently support. One way to support this would be 
> consuming the array to produce a builder with the array's elements 
> pre-inserted.
> {code}
> std::shared_ptr array = get_array();
> std::unique_ptr builder;
> RETURN_NOT_OK(MakeBuilder(std::move(*array), &builder));
> {code}
> This will be efficient if we cannibalize the array's buffers and child data 
> when constructing the builder, which will require that the consumed array is 
> uniquely owned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11125) [Rust] Implement logical equality for list arrays

2021-01-04 Thread Neville Dipale (Jira)

Neville Dipale created ARROW-11125:
--

 Summary: [Rust] Implement logical equality for list arrays
 Key: ARROW-11125
 URL: https://issues.apache.org/jira/browse/ARROW-11125
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Neville Dipale
Assignee: Neville Dipale


We implemented logical equality for struct arrays, but not list arrays.

This work is now required for the Parquet nested list writer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11065) [C++] Installation failed on AIX7.2

2021-01-04 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258385#comment-17258385
 ] 

Neal Richardson commented on ARROW-11065:
-

{{EXTRA_CMAKE_FLAGS}} is only picked up in a build script that R uses. Just add 
the ARROW_SIMD_LEVEL flag to your cmake invocation.

> [C++] Installation failed on AIX7.2
> ---
>
> Key: ARROW-11065
> URL: https://issues.apache.org/jira/browse/ARROW-11065
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 2.0.0
> Environment: AIX7.2
>Reporter: Xiaobo Zhang
>Priority: Major
> Attachments: CMakeError.log, CMakeError.log, CMakeOutput.log
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> My installation of pyarrow on AIX7.2 failed due to missing ARROW and I was 
> told I have to install ARROW C++ first.  I downloaded ARROW 2.0.0 
> {color:#24292e}tar ball and tried to install its "cpp" component according to 
> the instruction.  However, I got the following error after {{cd release}} to 
> run {{cmake ..}}: {color}
>  
> {noformat}
> Login=root: Line=602 > cmake ..
> -- Building using CMake version: 3.16.0
> -- Arrow version: 2.0.0 (full: '2.0.0')
> -- Arrow SO version: 200 (full: 200.0.0)
> -- clang-tidy not found
> -- clang-format not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
> -- infer not found
> -- Found cpplint executable at 
> /software/thirdparty/apache-arrow-2.0.0/cpp/build-support/cpplint.py
> -- System processor: powerpc
> -- Arrow build warning level: PRODUCTION
> CMake Error at cmake_modules/SetupCxxFlags.cmake:365 (message):
>   SSE4.2 required but compiler doesn't support it.
> Call Stack (most recent call first):
>   CMakeLists.txt:437 (include)
> -- Configuring incomplete, errors occurred!
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeOutput.log".
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeError.log".
> {noformat}
> Attached are 2 CMake output/error files.  Sutou Kouhei suggested me to submit 
> an issue here.  Can someone please help me to fix the issue?  What do I have 
> to do with required SSE4.2?
> Thanks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11125) [Rust] Implement logical equality for list arrays

2021-01-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11125:
---
Labels: pull-request-available  (was: )

> [Rust] Implement logical equality for list arrays
> -
>
> Key: ARROW-11125
> URL: https://issues.apache.org/jira/browse/ARROW-11125
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We implemented logical equality for struct arrays, but not list arrays.
> This work is now required for the Parquet nested list writer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11126) [Rust] Docunent and test ARROW-10656

2021-01-04 Thread Neville Dipale (Jira)

Neville Dipale created ARROW-11126:
--

 Summary: [Rust] Docunent and test ARROW-10656
 Key: ARROW-11126
 URL: https://issues.apache.org/jira/browse/ARROW-11126
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Neville Dipale


Looks like I rebased against the PR branch, but didn't push my changes before 
the PR was merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11126) [Rust] Document and test ARROW-10656

2021-01-04 Thread Neville Dipale (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-11126:
---
Summary: [Rust] Document and test ARROW-10656  (was: [Rust] Docunent and 
test ARROW-10656)

> [Rust] Document and test ARROW-10656
> 
>
> Key: ARROW-11126
> URL: https://issues.apache.org/jira/browse/ARROW-11126
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Neville Dipale
>Priority: Major
>
> Looks like I rebased against the PR branch, but didn't push my changes before 
> the PR was merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11126) [Rust] Document and test ARROW-10656

2021-01-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11126:
---
Labels: pull-request-available  (was: )

> [Rust] Document and test ARROW-10656
> 
>
> Key: ARROW-11126
> URL: https://issues.apache.org/jira/browse/ARROW-11126
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Looks like I rebased against the PR branch, but didn't push my changes before 
> the PR was merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-11126) [Rust] Document and test ARROW-10656

2021-01-04 Thread Neville Dipale (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-11126:
--

Assignee: Neville Dipale

> [Rust] Document and test ARROW-10656
> 
>
> Key: ARROW-11126
> URL: https://issues.apache.org/jira/browse/ARROW-11126
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Looks like I rebased against the PR branch, but didn't push my changes before 
> the PR was merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11115) [Rust] Implement dot-product in a compute kernel

2021-01-04 Thread Neville Dipale (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-5:
---
Summary: [Rust] Implement dot-product in a compute kernel  (was: Implement 
dot-product in a compute kernel)

> [Rust] Implement dot-product in a compute kernel
> 
>
> Key: ARROW-5
> URL: https://issues.apache.org/jira/browse/ARROW-5
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Erik De Smedt
>Priority: Minor
>  Labels: compute, dotproduct, linalg
>
> An efficient implementation of the 
> [dot-product|https://en.wikipedia.org/wiki/Dot_product] is useful for many 
> machine-learning applications.
> I would propose to treat null as zero for the implementation. This behavior 
> is sensible because it corresponds to dropping any observation where 
> null-data is present. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-10623) [R] Version 1.0.1 breaks data.frame attributes when reading file written by 2.0.0

2021-01-04 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-10623:
---

Assignee: Jonathan Keane

> [R] Version 1.0.1 breaks data.frame attributes when reading file written by 
> 2.0.0
> -
>
> Key: ARROW-10623
> URL: https://issues.apache.org/jira/browse/ARROW-10623
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1, 2.0.0
>Reporter: Fleur Kelpin
>Assignee: Jonathan Keane
>Priority: Major
> Fix For: 2.0.1, 3.0.0
>
>
> h4. How to reproduce
>  * Create a data frame:
> {noformat}
> df <- data.frame(col1 = 1:100){noformat}
>  * Write it to parquet file using apache 2.0.0. The demo uses R 3.6 but same 
> happens if you use R 4.0
>  * Read the parquet file using apache 1.0.1. I only tried that in R 3.6
> h4. Expected
> The data frame is the same as it was before:
> {noformat}
> structure(list(col1 = 1:100), row.names = c(NA, 100L), class = 
> "data.frame"){noformat}
> h4. Actual
> The data frame has lost some information:
> {noformat}
> structure(list(1:100), class = "data.frame"){noformat}
> h4. Demo
> I'm not sure what the easiest way is to put up a demo project for this, since 
> you need to switch between arrow installations. But I've created this docker 
> based demo:
> [https://github.com/fdlk/arrow2/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2021-01-04 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-11067:
---

Assignee: Antoine Pitrou

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_explanation.png, arrow_failure_cases.csv, 
> arrow_failure_cases.csv, arrowbug1.png, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-10695) [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset

2021-01-04 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-10695:
-
Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++][Dataset] Allow to use a UUID in the basename_template when writing a 
> dataset
> --
>
> Key: ARROW-10695
> URL: https://issues.apache.org/jira/browse/ARROW-10695
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset, dataset-parquet-write
> Fix For: 4.0.0
>
>
> Currently we allow the user to specify a {{basename_template}}, and this can 
> include a {{"\{i\}"}} part to replace it with an automatically incremented 
> integer (so each generated file written to a single partition is unique):
> https://github.com/apache/arrow/blob/master/python/pyarrow/dataset.py#L713-L717
> It _might_ be useful to also have the ability to use a UUID, to ensure the 
> file is unique in general (not only for a single write) and to mimic the 
> behaviour of the old {{write_to_dataset}} implementation.
> For example, we could look for a {{"\{uuid\}"}} in the template string, and 
> if present replace it for each file with a new UUID.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8655) [C++][Dataset][Python][R] Preserve partitioning information for a discovered Dataset

2021-01-04 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-8655:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++][Dataset][Python][R] Preserve partitioning information for a discovered 
> Dataset
> 
>
> Key: ARROW-8655
> URL: https://issues.apache.org/jira/browse/ARROW-8655
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset, dataset-dask-integration, pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Currently, we have the {{HivePartitioning}} and {{DirectoryPartitioning}} 
> classes that describe a partitioning used in the discovery phase. But once a 
> dataset object is created, it doesn't know any more about this, it just has 
> partition expressions for the fragments. And the partition keys are added to 
> the schema, but you can't directly know which columns of the schema 
> originated from the partitions.
> However, there can be use cases where it would be useful that a dataset still 
> "knows" from what kind of partitioning it was created:
> - The "read CSV write back Parquet" use case, where the CSV was already 
> partitioned and you want to automatically preserve that partitioning for 
> parquet (kind of roundtripping the partitioning on read/write)
> - To convert the dataset to other representation, eg conversion to pandas, it 
> can be useful to know what columns were partition columns (eg for pandas, 
> those columns might be good candidates to be set as the index of the 
> pandas/dask DataFrame). I can imagine conversions to other systems can use 
> similar information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-10882) [Python][Dataset] Writing dataset from python iterator of record batches

2021-01-04 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-10882:
-
Fix Version/s: (was: 3.0.0)
   4.0.0

> [Python][Dataset] Writing dataset from python iterator of record batches
> 
>
> Key: ARROW-10882
> URL: https://issues.apache.org/jira/browse/ARROW-10882
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset
> Fix For: 4.0.0
>
>
> At the moment, from python you can write a dataset with {{ds.write_dataset}} 
> for example starting from a *list* of record batches. 
> But this currently needs to be an actual list (or gets converted to a list), 
> so an iterator or generated gets fully consumed (potentially bringing the 
> record batches in memory), before starting to write. 
> We should also be able to use the python iterator itself to back a 
> {{RecordBatchIterator}}-like object, that can be consumed while writing the 
> batches.
> We already have a {{arrow::py::PyRecordBatchReader}} that might be useful 
> here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8282) [C++/Python][Dataset] Support schema evolution for integer columns

2021-01-04 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-8282:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++/Python][Dataset] Support schema evolution for integer columns
> --
>
> Key: ARROW-8282
> URL: https://issues.apache.org/jira/browse/ARROW-8282
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe Korn
>Priority: Major
>  Labels: dataset
> Fix For: 4.0.0
>
>
> When reading in a dataset where the schema specifies that column X is of type 
> {{int64}} but the partition actually contains the data stored in that columns 
> as {{int32}}, an upcast should be done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9749) [C++][Dataset] Extract format-specific scan options from FileFormat

2021-01-04 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-9749:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++][Dataset] Extract format-specific scan options from FileFormat
> ---
>
> Key: ARROW-9749
> URL: https://issues.apache.org/jira/browse/ARROW-9749
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: dataset
> Fix For: 4.0.0
>
>
> Currently format specific scan options are embedded as members of the 
> corresponding subclass of FileFormat. Extracting these to an options struct 
> would provide better separation of concerns; currently the only way to scan a 
> parquet formatted dataset with different options is to reconstruct it in a 
> differently optioned format from its component files.
> CsvFileFormat could retain ParseOptions as a member, since (for example) 
> tab-separated vs comma-separated values can justifiably be considered 
> different formats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9948) [C++] Decimal128 does not check scale range when rescaling; can cause buffer overflow

2021-01-04 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-9948:

Fix Version/s: 3.0.0

> [C++] Decimal128 does not check scale range when rescaling; can cause buffer 
> overflow
> -
>
> Key: ARROW-9948
> URL: https://issues.apache.org/jira/browse/ARROW-9948
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Mingyu Zhong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> BasicDecimal128::GetScaleMultiplier has a DCHECK on the scale, but the scale 
> can come from users. For example, Decimal128::FromString("1e100") will cause 
> an out-of-bound read.
> BasicDecimal128::Rescale and BasicDecimal128::GetWholeAndFraction have the 
> same problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9948) [C++] Decimal128 does not check scale range when rescaling; can cause buffer overflow

2021-01-04 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-9948:
---

Assignee: Andrew Wieteska

> [C++] Decimal128 does not check scale range when rescaling; can cause buffer 
> overflow
> -
>
> Key: ARROW-9948
> URL: https://issues.apache.org/jira/browse/ARROW-9948
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Mingyu Zhong
>Assignee: Andrew Wieteska
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> BasicDecimal128::GetScaleMultiplier has a DCHECK on the scale, but the scale 
> can come from users. For example, Decimal128::FromString("1e100") will cause 
> an out-of-bound read.
> BasicDecimal128::Rescale and BasicDecimal128::GetWholeAndFraction have the 
> same problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-10966) [C++] Use FnOnce for ThreadPool's tasks instead of std::function

2021-01-04 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-10966:


Assignee: (was: Ben Kietzman)

> [C++] Use FnOnce for ThreadPool's tasks instead of std::function
> 
>
> Key: ARROW-10966
> URL: https://issues.apache.org/jira/browse/ARROW-10966
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Major
> Fix For: 4.0.0
>
>
> FnOnce drops dependencies on invocation and is lighter weight than 
> std::function



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-10624) [R] Proactively remove "problems" attributes

2021-01-04 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-10624:
---
Summary: [R] Proactively remove "problems" attributes  (was: [R] Too much R 
metadata for Parquet format)

> [R] Proactively remove "problems" attributes
> 
>
> Key: ARROW-10624
> URL: https://issues.apache.org/jira/browse/ARROW-10624
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 2.0.0
> Environment: R version 3.6.1 (2019-07-05)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 19.10
>Reporter: Vinicius Pinto
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
> Attachments: debug-arrow.7z
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I cannot able to downgrade arrow from version 2.0.0 to 1.0.1 since 
> {{arrow::install_arrow()}} always installs the latest version.
> Steps to reproduce:
>  
> {code:java}
> devtools::install_version("arrow", version = "1.0.1")
> arrow::install_arrow(){code}
>  
> If I skip the {{arrow::install_arrow()}}, I am not able to use the gzip 
> compression ({{WARNING: Arrow Gzip is not available, try using 
> arrow::install_arrow()}})
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-11037) [Rust] Improve performance of string fromIter

2021-01-04 Thread Andrew Lamb (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-11037.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 9016
[https://github.com/apache/arrow/pull/9016]

> [Rust] Improve performance of string fromIter
> -
>
> Key: ARROW-11037
> URL: https://issues.apache.org/jira/browse/ARROW-11037
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Avoid copying from Vec to Buffer, writing directly to a buffer instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11065) [C++] Installation failed on AIX7.2

2021-01-04 Thread Xiaobo Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Zhang updated ARROW-11065:
-
Attachment: CMakeError.log

> [C++] Installation failed on AIX7.2
> ---
>
> Key: ARROW-11065
> URL: https://issues.apache.org/jira/browse/ARROW-11065
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 2.0.0
> Environment: AIX7.2
>Reporter: Xiaobo Zhang
>Priority: Major
> Attachments: CMakeError.log, CMakeError.log, CMakeError.log, 
> CMakeOutput.log, cmake.log
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> My installation of pyarrow on AIX7.2 failed due to missing ARROW and I was 
> told I have to install ARROW C++ first.  I downloaded ARROW 2.0.0 
> {color:#24292e}tar ball and tried to install its "cpp" component according to 
> the instruction.  However, I got the following error after {{cd release}} to 
> run {{cmake ..}}: {color}
>  
> {noformat}
> Login=root: Line=602 > cmake ..
> -- Building using CMake version: 3.16.0
> -- Arrow version: 2.0.0 (full: '2.0.0')
> -- Arrow SO version: 200 (full: 200.0.0)
> -- clang-tidy not found
> -- clang-format not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
> -- infer not found
> -- Found cpplint executable at 
> /software/thirdparty/apache-arrow-2.0.0/cpp/build-support/cpplint.py
> -- System processor: powerpc
> -- Arrow build warning level: PRODUCTION
> CMake Error at cmake_modules/SetupCxxFlags.cmake:365 (message):
>   SSE4.2 required but compiler doesn't support it.
> Call Stack (most recent call first):
>   CMakeLists.txt:437 (include)
> -- Configuring incomplete, errors occurred!
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeOutput.log".
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeError.log".
> {noformat}
> Attached are 2 CMake output/error files.  Sutou Kouhei suggested me to submit 
> an issue here.  Can someone please help me to fix the issue?  What do I have 
> to do with required SSE4.2?
> Thanks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11065) [C++] Installation failed on AIX7.2

2021-01-04 Thread Xiaobo Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Zhang updated ARROW-11065:
-
Attachment: cmake.log

> [C++] Installation failed on AIX7.2
> ---
>
> Key: ARROW-11065
> URL: https://issues.apache.org/jira/browse/ARROW-11065
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 2.0.0
> Environment: AIX7.2
>Reporter: Xiaobo Zhang
>Priority: Major
> Attachments: CMakeError.log, CMakeError.log, CMakeError.log, 
> CMakeOutput.log, cmake.log
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> My installation of pyarrow on AIX7.2 failed due to missing ARROW and I was 
> told I have to install ARROW C++ first.  I downloaded ARROW 2.0.0 
> {color:#24292e}tar ball and tried to install its "cpp" component according to 
> the instruction.  However, I got the following error after {{cd release}} to 
> run {{cmake ..}}: {color}
>  
> {noformat}
> Login=root: Line=602 > cmake ..
> -- Building using CMake version: 3.16.0
> -- Arrow version: 2.0.0 (full: '2.0.0')
> -- Arrow SO version: 200 (full: 200.0.0)
> -- clang-tidy not found
> -- clang-format not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
> -- infer not found
> -- Found cpplint executable at 
> /software/thirdparty/apache-arrow-2.0.0/cpp/build-support/cpplint.py
> -- System processor: powerpc
> -- Arrow build warning level: PRODUCTION
> CMake Error at cmake_modules/SetupCxxFlags.cmake:365 (message):
>   SSE4.2 required but compiler doesn't support it.
> Call Stack (most recent call first):
>   CMakeLists.txt:437 (include)
> -- Configuring incomplete, errors occurred!
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeOutput.log".
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeError.log".
> {noformat}
> Attached are 2 CMake output/error files.  Sutou Kouhei suggested me to submit 
> an issue here.  Can someone please help me to fix the issue?  What do I have 
> to do with required SSE4.2?
> Thanks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11065) [C++] Installation failed on AIX7.2

2021-01-04 Thread Xiaobo Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258457#comment-17258457
 ] 

Xiaobo Zhang commented on ARROW-11065:
--

I am a little puzzled now.  I issued "cmake -DARROW_SIMD_LEVEL=NONE .. 
>cmake.log 2>&1 &",  In cmake.log file, ARROW_SIMD_LEVEL=NONE at line 112.  
However, I still can see SSE error at line 8 in CMakeError.log.  Besides, there 
are additional unrecognized options/symbols as shown below.

c++: error: unrecognized command line option '-march=haswell'
c++: error: unrecognized command line option '-mavx2'; did you mean '-maix32'?

c++: error: unrecognized command line option '-march=skylake-avx512'
c++: error: unrecognized command line option '-mbmi2'
c++: error: unrecognized command line option '-mavx512f'; did you mean 
'-maix32'?
c++: error: unrecognized command line option '-mavx512cd'
c++: error: unrecognized command line option '-mavx512vl'
c++: error: unrecognized command line option '-mavx512dq'
c++: error: unrecognized command line option '-mavx512bw'

ld: 0711-317 ERROR: Undefined symbol: .pthread_create
ld: 0711-317 ERROR: Undefined symbol: .pthread_detach
ld: 0711-317 ERROR: Undefined symbol: .pthread_join
ld: 0711-317 ERROR: Undefined symbol: .pthread_atfork
ld: 0711-317 ERROR: Undefined symbol: .pthread_exit 

 

At line 102, there is a fatal error with missing  execinfo.h file.  The bad new 
is that CMakeTmp subdirectory is empty at the end so I can't check 
CheckSymbolExists.c. 

/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeTmp/CheckSymbolExists.c:2:10:
 fatal error: execinfo.h: No such file or directory
 #include 
  ^~~~
compilation terminated.

 

Looks like there are lots of works to be finished in order to install Apache 
C++ on AIX.

Thanks.

[^cmake.log][^CMakeError.log]

> [C++] Installation failed on AIX7.2
> ---
>
> Key: ARROW-11065
> URL: https://issues.apache.org/jira/browse/ARROW-11065
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 2.0.0
> Environment: AIX7.2
>Reporter: Xiaobo Zhang
>Priority: Major
> Attachments: CMakeError.log, CMakeError.log, CMakeError.log, 
> CMakeOutput.log, cmake.log
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> My installation of pyarrow on AIX7.2 failed due to missing ARROW and I was 
> told I have to install ARROW C++ first.  I downloaded ARROW 2.0.0 
> {color:#24292e}tar ball and tried to install its "cpp" component according to 
> the instruction.  However, I got the following error after {{cd release}} to 
> run {{cmake ..}}: {color}
>  
> {noformat}
> Login=root: Line=602 > cmake ..
> -- Building using CMake version: 3.16.0
> -- Arrow version: 2.0.0 (full: '2.0.0')
> -- Arrow SO version: 200 (full: 200.0.0)
> -- clang-tidy not found
> -- clang-format not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
> -- infer not found
> -- Found cpplint executable at 
> /software/thirdparty/apache-arrow-2.0.0/cpp/build-support/cpplint.py
> -- System processor: powerpc
> -- Arrow build warning level: PRODUCTION
> CMake Error at cmake_modules/SetupCxxFlags.cmake:365 (message):
>   SSE4.2 required but compiler doesn't support it.
> Call Stack (most recent call first):
>   CMakeLists.txt:437 (include)
> -- Configuring incomplete, errors occurred!
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeOutput.log".
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeError.log".
> {noformat}
> Attached are 2 CMake output/error files.  Sutou Kouhei suggested me to submit 
> an issue here.  Can someone please help me to fix the issue?  What do I have 
> to do with required SSE4.2?
> Thanks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-10966) [C++] Use FnOnce for ThreadPool's tasks instead of std::function

2021-01-04 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-10966:


Assignee: Michal Nowakiewicz

> [C++] Use FnOnce for ThreadPool's tasks instead of std::function
> 
>
> Key: ARROW-10966
> URL: https://issues.apache.org/jira/browse/ARROW-10966
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Michal Nowakiewicz
>Priority: Major
> Fix For: 4.0.0
>
>
> FnOnce drops dependencies on invocation and is lighter weight than 
> std::function



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10881) [C++] EXC_BAD_ACCESS in BaseSetBitRunReader::NextRun

2021-01-04 Thread Uwe Korn (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-10881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258472#comment-17258472
 ] 

Uwe Korn commented on ARROW-10881:
--

With more debug information added, we get this information:

{code}
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
(code=1, address=0x0)
frame #0: 0x0001009a5340 
libparquet.300.dylib`arrow::internal::BaseSetBitRunReader::LoadFullWord(this=0x00016fdfce30)
 at bit_run_reader.h:264:5
   261  if (Reverse) {
   262bitmap_ -= 8;
   263  }
-> 264  memcpy(&word, bitmap_, 8);
   265  if (!Reverse) {
   266bitmap_ += 8;
   267  }
Target 0: (parquet-encoding-benchmark) stopped.
(lldb) ll
error: 'll' is not a valid command.
(lldb) p word
(uint64_t) $0 = 6171905360
(lldb) p bitmap_
(const uint8_t *) $1 = 0x
{code}

> [C++] EXC_BAD_ACCESS in BaseSetBitRunReader::NextRun
> ---
>
> Key: ARROW-10881
> URL: https://issues.apache.org/jira/browse/ARROW-10881
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Uwe Korn
>Priority: Major
>  Labels: osx-arm64
>
> {{./release/parquet-encoding-benchmark}} fails with
> {code}
> BM_PlainDecodingFloat/65536  4206 ns 4206 
> ns   167354 bytes_per_second=58.0474G/s
> error: libparquet.300.dylib debug map object file 
> '/Users/uwe/Development/arrow/cpp/build/src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o'
>  has changed (actual time is 2020-12-10 22:57:29.0, debug map time is 
> 2020-12-10 21:02:52.0) since this executable was linked, file will be 
> ignored
> Process 11120 stopped
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x0)
> frame #0: 0x00010047fe04 
> libparquet.300.dylib`arrow::internal::BaseSetBitRunReader::NextRun() + 
> 192
> libparquet.300.dylib`arrow::internal::BaseSetBitRunReader::NextRun:
> ->  0x10047fe04 <+192>: ldur   x11, [x9, #-0x8]
> 0x10047fe08 <+196>: strx9, [x19]
> 0x10047fe0c <+200>: strx11, [x19, #0x18]
> 0x10047fe10 <+204>: rbit   x10, x11
> Target 0: (parquet-encoding-benchmark) stopped.
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x0)
>   * frame #0: 0x00010047fe04 
> libparquet.300.dylib`arrow::internal::BaseSetBitRunReader::NextRun() + 
> 192
> frame #1: 0x00010047f808 libparquet.300.dylib`parquet::(anonymous 
> namespace)::PlainEncoder 
> >::PutSpaced(bool const*, int, unsigned char const*, long long) + 336
> frame #2: 0x00018970 
> parquet-encoding-benchmark`parquet::BM_PlainEncodingSpacedBoolean(benchmark::State&)
>  at encoding_benchmark.cc:249:14 [opt]
> frame #3: 0x0001881c 
> parquet-encoding-benchmark`parquet::BM_PlainEncodingSpacedBoolean(state=0x00016fdfd4b8)
>  at encoding_benchmark.cc:257 [opt]
> frame #4: 0x0001001614f4 
> libbenchmark.0.dylib`benchmark::internal::BenchmarkInstance::Run(unsigned 
> long long, int, benchmark::internal::ThreadTimer*, 
> benchmark::internal::ThreadManager*) const + 68
> frame #5: 0x000100173ae8 
> libbenchmark.0.dylib`benchmark::internal::(anonymous 
> namespace)::RunInThread(benchmark::internal::BenchmarkInstance const*, 
> unsigned long long, int, benchmark::internal::ThreadManager*) + 80
> frame #6: 0x0001001723c8 
> libbenchmark.0.dylib`benchmark::internal::RunBenchmark(benchmark::internal::BenchmarkInstance
>  const&, std::__1::vector std::__1::allocator >*) + 1284
> frame #7: 0x00010015ee7c 
> libbenchmark.0.dylib`benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*,
>  benchmark::BenchmarkReporter*) + 1824
> frame #8: 0x00010014beec libbenchmark_main.0.dylib`main + 76
> frame #9: 0x00019e270f54 libdyld.dylib`start + 4
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10881) [C++] EXC_BAD_ACCESS in BaseSetBitRunReader::NextRun

2021-01-04 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-10881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258474#comment-17258474
 ] 

Antoine Pitrou commented on ARROW-10881:


Can you also post the full traceback? That should give accurate line numbers 
for all frames.

> [C++] EXC_BAD_ACCESS in BaseSetBitRunReader::NextRun
> ---
>
> Key: ARROW-10881
> URL: https://issues.apache.org/jira/browse/ARROW-10881
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Uwe Korn
>Priority: Major
>  Labels: osx-arm64
>
> {{./release/parquet-encoding-benchmark}} fails with
> {code}
> BM_PlainDecodingFloat/65536  4206 ns 4206 
> ns   167354 bytes_per_second=58.0474G/s
> error: libparquet.300.dylib debug map object file 
> '/Users/uwe/Development/arrow/cpp/build/src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o'
>  has changed (actual time is 2020-12-10 22:57:29.0, debug map time is 
> 2020-12-10 21:02:52.0) since this executable was linked, file will be 
> ignored
> Process 11120 stopped
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x0)
> frame #0: 0x00010047fe04 
> libparquet.300.dylib`arrow::internal::BaseSetBitRunReader::NextRun() + 
> 192
> libparquet.300.dylib`arrow::internal::BaseSetBitRunReader::NextRun:
> ->  0x10047fe04 <+192>: ldur   x11, [x9, #-0x8]
> 0x10047fe08 <+196>: strx9, [x19]
> 0x10047fe0c <+200>: strx11, [x19, #0x18]
> 0x10047fe10 <+204>: rbit   x10, x11
> Target 0: (parquet-encoding-benchmark) stopped.
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x0)
>   * frame #0: 0x00010047fe04 
> libparquet.300.dylib`arrow::internal::BaseSetBitRunReader::NextRun() + 
> 192
> frame #1: 0x00010047f808 libparquet.300.dylib`parquet::(anonymous 
> namespace)::PlainEncoder 
> >::PutSpaced(bool const*, int, unsigned char const*, long long) + 336
> frame #2: 0x00018970 
> parquet-encoding-benchmark`parquet::BM_PlainEncodingSpacedBoolean(benchmark::State&)
>  at encoding_benchmark.cc:249:14 [opt]
> frame #3: 0x0001881c 
> parquet-encoding-benchmark`parquet::BM_PlainEncodingSpacedBoolean(state=0x00016fdfd4b8)
>  at encoding_benchmark.cc:257 [opt]
> frame #4: 0x0001001614f4 
> libbenchmark.0.dylib`benchmark::internal::BenchmarkInstance::Run(unsigned 
> long long, int, benchmark::internal::ThreadTimer*, 
> benchmark::internal::ThreadManager*) const + 68
> frame #5: 0x000100173ae8 
> libbenchmark.0.dylib`benchmark::internal::(anonymous 
> namespace)::RunInThread(benchmark::internal::BenchmarkInstance const*, 
> unsigned long long, int, benchmark::internal::ThreadManager*) + 80
> frame #6: 0x0001001723c8 
> libbenchmark.0.dylib`benchmark::internal::RunBenchmark(benchmark::internal::BenchmarkInstance
>  const&, std::__1::vector std::__1::allocator >*) + 1284
> frame #7: 0x00010015ee7c 
> libbenchmark.0.dylib`benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*,
>  benchmark::BenchmarkReporter*) + 1824
> frame #8: 0x00010014beec libbenchmark_main.0.dylib`main + 76
> frame #9: 0x00019e270f54 libdyld.dylib`start + 4
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9612) [Python] Automatically back on larger IO block size when JSON parsing fails

2021-01-04 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pere-Lluís Huguet Cabot updated ARROW-9612:
---
Attachment: wiki_04.jsonl

> [Python] Automatically back on larger IO block size when JSON parsing fails
> ---
>
> Key: ARROW-9612
> URL: https://issues.apache.org/jira/browse/ARROW-9612
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: wiki_04.jsonl
>
>
> From GitHub issue
> https://github.com/apache/arrow/issues/7835
> This seems like a less than ideal failure mode, perhaps when this occurs it 
> could automatically change to processing the file as a single block?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9612) [Python] Automatically back on larger IO block size when JSON parsing fails

2021-01-04 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pere-Lluís Huguet Cabot updated ARROW-9612:
---
Attachment: (was: wiki_04.jsonl)

> [Python] Automatically back on larger IO block size when JSON parsing fails
> ---
>
> Key: ARROW-9612
> URL: https://issues.apache.org/jira/browse/ARROW-9612
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 3.0.0
>
>
> From GitHub issue
> https://github.com/apache/arrow/issues/7835
> This seems like a less than ideal failure mode, perhaps when this occurs it 
> could automatically change to processing the file as a single block?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10881) [C++] EXC_BAD_ACCESS in BaseSetBitRunReader::NextRun

2021-01-04 Thread Uwe Korn (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-10881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258481#comment-17258481
 ] 

Uwe Korn commented on ARROW-10881:
--

I'll push a PR later that should fix this ;)

> [C++] EXC_BAD_ACCESS in BaseSetBitRunReader::NextRun
> ---
>
> Key: ARROW-10881
> URL: https://issues.apache.org/jira/browse/ARROW-10881
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Uwe Korn
>Priority: Major
>  Labels: osx-arm64
>
> {{./release/parquet-encoding-benchmark}} fails with
> {code}
> BM_PlainDecodingFloat/65536  4206 ns 4206 
> ns   167354 bytes_per_second=58.0474G/s
> error: libparquet.300.dylib debug map object file 
> '/Users/uwe/Development/arrow/cpp/build/src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o'
>  has changed (actual time is 2020-12-10 22:57:29.0, debug map time is 
> 2020-12-10 21:02:52.0) since this executable was linked, file will be 
> ignored
> Process 11120 stopped
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x0)
> frame #0: 0x00010047fe04 
> libparquet.300.dylib`arrow::internal::BaseSetBitRunReader::NextRun() + 
> 192
> libparquet.300.dylib`arrow::internal::BaseSetBitRunReader::NextRun:
> ->  0x10047fe04 <+192>: ldur   x11, [x9, #-0x8]
> 0x10047fe08 <+196>: strx9, [x19]
> 0x10047fe0c <+200>: strx11, [x19, #0x18]
> 0x10047fe10 <+204>: rbit   x10, x11
> Target 0: (parquet-encoding-benchmark) stopped.
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x0)
>   * frame #0: 0x00010047fe04 
> libparquet.300.dylib`arrow::internal::BaseSetBitRunReader::NextRun() + 
> 192
> frame #1: 0x00010047f808 libparquet.300.dylib`parquet::(anonymous 
> namespace)::PlainEncoder 
> >::PutSpaced(bool const*, int, unsigned char const*, long long) + 336
> frame #2: 0x00018970 
> parquet-encoding-benchmark`parquet::BM_PlainEncodingSpacedBoolean(benchmark::State&)
>  at encoding_benchmark.cc:249:14 [opt]
> frame #3: 0x0001881c 
> parquet-encoding-benchmark`parquet::BM_PlainEncodingSpacedBoolean(state=0x00016fdfd4b8)
>  at encoding_benchmark.cc:257 [opt]
> frame #4: 0x0001001614f4 
> libbenchmark.0.dylib`benchmark::internal::BenchmarkInstance::Run(unsigned 
> long long, int, benchmark::internal::ThreadTimer*, 
> benchmark::internal::ThreadManager*) const + 68
> frame #5: 0x000100173ae8 
> libbenchmark.0.dylib`benchmark::internal::(anonymous 
> namespace)::RunInThread(benchmark::internal::BenchmarkInstance const*, 
> unsigned long long, int, benchmark::internal::ThreadManager*) + 80
> frame #6: 0x0001001723c8 
> libbenchmark.0.dylib`benchmark::internal::RunBenchmark(benchmark::internal::BenchmarkInstance
>  const&, std::__1::vector std::__1::allocator >*) + 1284
> frame #7: 0x00010015ee7c 
> libbenchmark.0.dylib`benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*,
>  benchmark::BenchmarkReporter*) + 1824
> frame #8: 0x00010014beec libbenchmark_main.0.dylib`main + 76
> frame #9: 0x00019e270f54 libdyld.dylib`start + 4
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-6582) [R] Arrow to R fails with embedded nuls in strings

2021-01-04 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6582.

Resolution: Fixed

Issue resolved by pull request 8365
[https://github.com/apache/arrow/pull/8365]

> [R] Arrow to R fails with embedded nuls in strings
> --
>
> Key: ARROW-6582
> URL: https://issues.apache.org/jira/browse/ARROW-6582
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.14.1
> Environment: Windows 10
> R 3.4.4
>Reporter: John Cassil
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Apologies if this issue isn't categorized or documented appropriately.  
> Please be gentle! :)
> As a heavy R user that normally interacts with parquet files using SparklyR, 
> I have recently decided to try to use arrow::read_parquet() on a few parquet 
> files that were on my local machine rather than in hadoop.  I was not able to 
> proceed after several various attempts due to embedded nuls.  For example:
> try({df <- read_parquet('out_2019-09_data_1.snappy.parquet') })
> Error in Table__to_dataframe(x, use_threads = option_use_threads()) : 
>   embedded nul in string: 'INSTALL BOTH LEFT FRONT AND RIGHT FRONT  TORQUE 
> ARMS\0 ARMS'
> Is there a solution to this?
> I have also hit roadblocks with embedded nuls in the past with csvs using 
> data.table::fread(), but readr::read_delim() seems to handle them gracefully 
> with just a warning after proceeding.
> Apologies that I do not have a handy reprex. I don't know if I can even 
> recreate a parquet file with embedded nuls using arrow if it won't let me 
> read one in, and I can't share this file due to company restrictions.
> Please let me know how I can be of any more help!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11072) [Rust] [Parquet] Support int32 and int64 physical types

2021-01-04 Thread Andrew Lamb (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-11072:

Summary: [Rust] [Parquet] Support int32 and int64 physical types  (was: 
Support int32 and int64 physical types)

> [Rust] [Parquet] Support int32 and int64 physical types
> ---
>
> Key: ARROW-11072
> URL: https://issues.apache.org/jira/browse/ARROW-11072
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Florian Müller
>Assignee: Florian Müller
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-11072) [Rust] [Parquet] Support int32 and int64 physical types

2021-01-04 Thread Andrew Lamb (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-11072.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 9047
[https://github.com/apache/arrow/pull/9047]

> [Rust] [Parquet] Support int32 and int64 physical types
> ---
>
> Key: ARROW-11072
> URL: https://issues.apache.org/jira/browse/ARROW-11072
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Florian Müller
>Assignee: Florian Müller
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-10881) [C++] EXC_BAD_ACCESS in BaseSetBitRunReader::NextRun

2021-01-04 Thread Uwe Korn (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn reassigned ARROW-10881:


Assignee: Uwe Korn

> [C++] EXC_BAD_ACCESS in BaseSetBitRunReader::NextRun
> ---
>
> Key: ARROW-10881
> URL: https://issues.apache.org/jira/browse/ARROW-10881
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: osx-arm64
>
> {{./release/parquet-encoding-benchmark}} fails with
> {code}
> BM_PlainDecodingFloat/65536  4206 ns 4206 
> ns   167354 bytes_per_second=58.0474G/s
> error: libparquet.300.dylib debug map object file 
> '/Users/uwe/Development/arrow/cpp/build/src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o'
>  has changed (actual time is 2020-12-10 22:57:29.0, debug map time is 
> 2020-12-10 21:02:52.0) since this executable was linked, file will be 
> ignored
> Process 11120 stopped
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x0)
> frame #0: 0x00010047fe04 
> libparquet.300.dylib`arrow::internal::BaseSetBitRunReader::NextRun() + 
> 192
> libparquet.300.dylib`arrow::internal::BaseSetBitRunReader::NextRun:
> ->  0x10047fe04 <+192>: ldur   x11, [x9, #-0x8]
> 0x10047fe08 <+196>: strx9, [x19]
> 0x10047fe0c <+200>: strx11, [x19, #0x18]
> 0x10047fe10 <+204>: rbit   x10, x11
> Target 0: (parquet-encoding-benchmark) stopped.
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x0)
>   * frame #0: 0x00010047fe04 
> libparquet.300.dylib`arrow::internal::BaseSetBitRunReader::NextRun() + 
> 192
> frame #1: 0x00010047f808 libparquet.300.dylib`parquet::(anonymous 
> namespace)::PlainEncoder 
> >::PutSpaced(bool const*, int, unsigned char const*, long long) + 336
> frame #2: 0x00018970 
> parquet-encoding-benchmark`parquet::BM_PlainEncodingSpacedBoolean(benchmark::State&)
>  at encoding_benchmark.cc:249:14 [opt]
> frame #3: 0x0001881c 
> parquet-encoding-benchmark`parquet::BM_PlainEncodingSpacedBoolean(state=0x00016fdfd4b8)
>  at encoding_benchmark.cc:257 [opt]
> frame #4: 0x0001001614f4 
> libbenchmark.0.dylib`benchmark::internal::BenchmarkInstance::Run(unsigned 
> long long, int, benchmark::internal::ThreadTimer*, 
> benchmark::internal::ThreadManager*) const + 68
> frame #5: 0x000100173ae8 
> libbenchmark.0.dylib`benchmark::internal::(anonymous 
> namespace)::RunInThread(benchmark::internal::BenchmarkInstance const*, 
> unsigned long long, int, benchmark::internal::ThreadManager*) + 80
> frame #6: 0x0001001723c8 
> libbenchmark.0.dylib`benchmark::internal::RunBenchmark(benchmark::internal::BenchmarkInstance
>  const&, std::__1::vector std::__1::allocator >*) + 1284
> frame #7: 0x00010015ee7c 
> libbenchmark.0.dylib`benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*,
>  benchmark::BenchmarkReporter*) + 1824
> frame #8: 0x00010014beec libbenchmark_main.0.dylib`main + 76
> frame #9: 0x00019e270f54 libdyld.dylib`start + 4
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-10881) [C++] EXC_BAD_ACCESS in BaseSetBitRunReader::NextRun

2021-01-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10881:
---
Labels: osx-arm64 pull-request-available  (was: osx-arm64)

> [C++] EXC_BAD_ACCESS in BaseSetBitRunReader::NextRun
> ---
>
> Key: ARROW-10881
> URL: https://issues.apache.org/jira/browse/ARROW-10881
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: osx-arm64, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{./release/parquet-encoding-benchmark}} fails with
> {code}
> BM_PlainDecodingFloat/65536  4206 ns 4206 
> ns   167354 bytes_per_second=58.0474G/s
> error: libparquet.300.dylib debug map object file 
> '/Users/uwe/Development/arrow/cpp/build/src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o'
>  has changed (actual time is 2020-12-10 22:57:29.0, debug map time is 
> 2020-12-10 21:02:52.0) since this executable was linked, file will be 
> ignored
> Process 11120 stopped
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x0)
> frame #0: 0x00010047fe04 
> libparquet.300.dylib`arrow::internal::BaseSetBitRunReader::NextRun() + 
> 192
> libparquet.300.dylib`arrow::internal::BaseSetBitRunReader::NextRun:
> ->  0x10047fe04 <+192>: ldur   x11, [x9, #-0x8]
> 0x10047fe08 <+196>: strx9, [x19]
> 0x10047fe0c <+200>: strx11, [x19, #0x18]
> 0x10047fe10 <+204>: rbit   x10, x11
> Target 0: (parquet-encoding-benchmark) stopped.
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x0)
>   * frame #0: 0x00010047fe04 
> libparquet.300.dylib`arrow::internal::BaseSetBitRunReader::NextRun() + 
> 192
> frame #1: 0x00010047f808 libparquet.300.dylib`parquet::(anonymous 
> namespace)::PlainEncoder 
> >::PutSpaced(bool const*, int, unsigned char const*, long long) + 336
> frame #2: 0x00018970 
> parquet-encoding-benchmark`parquet::BM_PlainEncodingSpacedBoolean(benchmark::State&)
>  at encoding_benchmark.cc:249:14 [opt]
> frame #3: 0x0001881c 
> parquet-encoding-benchmark`parquet::BM_PlainEncodingSpacedBoolean(state=0x00016fdfd4b8)
>  at encoding_benchmark.cc:257 [opt]
> frame #4: 0x0001001614f4 
> libbenchmark.0.dylib`benchmark::internal::BenchmarkInstance::Run(unsigned 
> long long, int, benchmark::internal::ThreadTimer*, 
> benchmark::internal::ThreadManager*) const + 68
> frame #5: 0x000100173ae8 
> libbenchmark.0.dylib`benchmark::internal::(anonymous 
> namespace)::RunInThread(benchmark::internal::BenchmarkInstance const*, 
> unsigned long long, int, benchmark::internal::ThreadManager*) + 80
> frame #6: 0x0001001723c8 
> libbenchmark.0.dylib`benchmark::internal::RunBenchmark(benchmark::internal::BenchmarkInstance
>  const&, std::__1::vector std::__1::allocator >*) + 1284
> frame #7: 0x00010015ee7c 
> libbenchmark.0.dylib`benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*,
>  benchmark::BenchmarkReporter*) + 1824
> frame #8: 0x00010014beec libbenchmark_main.0.dylib`main + 76
> frame #9: 0x00019e270f54 libdyld.dylib`start + 4
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9612) [Python] Automatically back on larger IO block size when JSON parsing fails

2021-01-04 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pere-Lluís Huguet Cabot updated ARROW-9612:
---
Attachment: wiki_04.jsonl

> [Python] Automatically back on larger IO block size when JSON parsing fails
> ---
>
> Key: ARROW-9612
> URL: https://issues.apache.org/jira/browse/ARROW-9612
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: wiki_04.jsonl
>
>
> From GitHub issue
> https://github.com/apache/arrow/issues/7835
> This seems like a less than ideal failure mode, perhaps when this occurs it 
> could automatically change to processing the file as a single block?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9612) [Python] Automatically back on larger IO block size when JSON parsing fails

2021-01-04 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258501#comment-17258501
 ] 

Pere-Lluís Huguet Cabot commented on ARROW-9612:


I have added a file that prompts the same error. It is a dump of wikipedia 
abstracts with wikidata information. 

> [Python] Automatically back on larger IO block size when JSON parsing fails
> ---
>
> Key: ARROW-9612
> URL: https://issues.apache.org/jira/browse/ARROW-9612
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: wiki_04.jsonl
>
>
> From GitHub issue
> https://github.com/apache/arrow/issues/7835
> This seems like a less than ideal failure mode, perhaps when this occurs it 
> could automatically change to processing the file as a single block?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-11096) [Rust] Add FFI for [Large]Binary

2021-01-04 Thread Neville Dipale (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-11096.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 9065
[https://github.com/apache/arrow/pull/9065]

> [Rust] Add FFI for [Large]Binary
> 
>
> Key: ARROW-11096
> URL: https://issues.apache.org/jira/browse/ARROW-11096
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11127) [C++] Unused cpu_info on non-x86 architecture

2021-01-04 Thread Uwe Korn (Jira)

Uwe Korn created ARROW-11127:


 Summary: [C++] Unused cpu_info on non-x86 architecture
 Key: ARROW-11127
 URL: https://issues.apache.org/jira/browse/ARROW-11127
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 2.0.0
Reporter: Uwe Korn
Assignee: Uwe Korn






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7878) [C++] Implement LogicalPlan and LogicalPlanBuilder

2021-01-04 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7878:
--

Assignee: (was: Francois Saint-Jacques)

> [C++] Implement LogicalPlan and LogicalPlanBuilder
> --
>
> Key: ARROW-7878
> URL: https://issues.apache.org/jira/browse/ARROW-7878
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 18h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7878) [C++] Implement LogicalPlan and LogicalPlanBuilder

2021-01-04 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7878:
---
Fix Version/s: (was: 3.0.0)

> [C++] Implement LogicalPlan and LogicalPlanBuilder
> --
>
> Key: ARROW-7878
> URL: https://issues.apache.org/jira/browse/ARROW-7878
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 18h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11127) [C++] Unused cpu_info on non-x86 architecture

2021-01-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11127:
---
Labels: pull-request-available  (was: )

> [C++] Unused cpu_info on non-x86 architecture
> -
>
> Key: ARROW-11127
> URL: https://issues.apache.org/jira/browse/ARROW-11127
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11128) [Rust] Optimize CSV writer

2021-01-04 Thread Jira

Daniël Heres created ARROW-11128:


 Summary: [Rust] Optimize CSV writer
 Key: ARROW-11128
 URL: https://issues.apache.org/jira/browse/ARROW-11128
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Daniël Heres
Assignee: Daniël Heres






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-2303) [C++] Disable ASAN when building io-hdfs-test.cc

2021-01-04 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-2303:
---
Fix Version/s: (was: 3.0.0)

> [C++] Disable ASAN when building io-hdfs-test.cc
> 
>
> Key: ARROW-2303
> URL: https://issues.apache.org/jira/browse/ARROW-2303
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Minor
>
> ASAN reports spurious memory leaks in this unit test module. I am not sure 
> the easiest way to conditionally scrub the ASAN flags from such a unit test's 
> compilation flags



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-11128) [Rust] Optimize CSV writer

2021-01-04 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-11128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniël Heres closed ARROW-11128.

Resolution: Abandoned

> [Rust] Optimize CSV writer
> --
>
> Key: ARROW-11128
> URL: https://issues.apache.org/jira/browse/ARROW-11128
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Daniël Heres
>Assignee: Daniël Heres
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7001) [C++] Develop threading APIs to accommodate nested parallelism

2021-01-04 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7001:
--

Assignee: Weston Pace

> [C++] Develop threading APIs to accommodate nested parallelism 
> ---
>
> Key: ARROW-7001
> URL: https://issues.apache.org/jira/browse/ARROW-7001
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Weston Pace
>Priority: Major
> Fix For: 3.0.0
>
>
> Tasks invoked in parallel may be able to submit their own subtasks, which in 
> OpenMP and TBB documentation is often called "nested parallelism". 
> If a task blocks on the completion of subtasks, then outright deadlocks are 
> possible -- running tasks are all blocking on their subtasks, but the thread 
> pool will not schedule any further tasks.
> I suggest that such code have a way to indicate to the thread pool (if one is 
> passed in) that it is blocking on the completion of other tasks so that 
> further tasks can be run while the task waits for its child tasks to 
> complete. One possible way to do this is to have a floating "soft limit" for 
> concurrent tasks that can be incremented when tasks are waiting. 
> So if we normally allow 8 concurrent tasks, then this can be temporarily 
> increased for each "suspended" task. Preferably we would provide some way for 
> the dependent task group to "awaken" the suspended task so that it does not 
> have to do any work while waiting for the task group to finish
> Note this feature can also be used in tasks that are waiting for IO calls



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8599) [C++][Parquet] Optional parallel processing when writing Parquet files

2021-01-04 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-8599:
--

Assignee: Weston Pace

> [C++][Parquet] Optional parallel processing when writing Parquet files
> --
>
> Key: ARROW-8599
> URL: https://issues.apache.org/jira/browse/ARROW-8599
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Weston Pace
>Priority: Major
> Fix For: 3.0.0
>
>
> If we permit encoded columns in row groups to be buffered in memory rather 
> than immediately written out to the {{OutputStream}}, then we can use 
> multiple threads for the encoding / compression of columns. Combined with a 
> separate thread to take the encoded columns and write them out to disk, this 
> should yield substantially improved file write times.
> This could be enabled through an option since it would increase memory use 
> when writing. The memory use can be somewhat constrained by limiting the size 
> of row groups



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8626) [C++] Implement "round robin" scheduler interface to fixed-size ThreadPool

2021-01-04 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-8626:
--

Assignee: Weston Pace

> [C++] Implement "round robin" scheduler interface to fixed-size ThreadPool 
> ---
>
> Key: ARROW-8626
> URL: https://issues.apache.org/jira/browse/ARROW-8626
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Weston Pace
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, when submitting tasks to a thread pool, they are all commingled in 
> a common queue. When a new task submitter shows up, they must wait in the 
> back of the line behind all other queued tasks.
> A simple alternative to this would be round-robin scheduling, where each new 
> consumer is assigned a unique integer id, and the schedule / thread pool 
> internally maintains the tasks associated with the consumer in separate 
> queues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-1106) [C++] Native result set adapter for PostgreSQL / libpq

2021-01-04 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-1106:
---
Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++] Native result set adapter for PostgreSQL / libpq
> --
>
> Key: ARROW-1106
> URL: https://issues.apache.org/jira/browse/ARROW-1106
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: database
> Fix For: 4.0.0
>
>
> We can look at https://github.com/MagicStack/asyncpg for inspiration



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 >

1 - 100 of 176 matches

Mail list logo