> Still, I would like to see "columnar" used in the first sentence as this is 
> the main focus of the project.

It's interesting, slightly de-emphasizing the role of the columnar
format is actually one of my objectives of the revisions. It does not
mean that the columnar specification is not a critical component of
the project: it absolutely is and one of centerpieces of the project.

But the scope of Arrow has already become larger than that -- as time
goes on the project's center of gravity concerns general management of
in-memory analytical datasets. These may not be structured (and
columnar) 100% of the time -- for example, you could use Arrow to
write a collection of simple buffers (without any additional type
metadata) to shared memory, then read them back with zero copy. This
requires maintaining a general "memory management system" that is
necessary for everything else, and the columnar format is built on top
of this. It's pretty complex to be able to manage zero-copy memory
references for arbitrarily complex

I see the C++ library in 4 distinct layers, for example:

* General zero-copy memory management: Plasma, arrow::Buffer,
arrow::MemoryPool, the contents of arrow::io (e.g. zero-copy
io::BufferReader, etc.)
* Columnar memory format / data structures / in-memory metadata :
arrow::DataType / Array
* Structured data IPC: arrays, record batches, and any other new
message types (e.g. tensors)
* Columnar in-memory analytics: what we are just beginning to
implement in arrow/compute

I think to express to the open source community that in-memory data
problems that are not columnar are of no interest to the Arrow
community would be needlessly closing off collaboration opportunities.
It's important that a larger audience is able to consume Arrow's
memory management layer and IPC tools (e.g. they can easily be used
for deep learning / ML applications) and use them to create more kinds
of applications architected around the mantra of zero-copy. With new
architectures designed to leverage non-volatile memory on the horizon,
this grows more important with each passing day.

- Wes

On Sun, Oct 22, 2017 at 7:32 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
> Thank you Wes and Julian for taking the approach to improve the elevator
> pitch. I really like the improvements. Still, I would like to see
> "columnar" used in the first sentence as this is the main focus of the
> project.
>
> Uwe
>
> On Sat, Oct 21, 2017, at 10:32 PM, Wes McKinney wrote:
>> Thanks Julian, I like the changes.
>>
>> For the last part I agree listing languages is good; we would do well
>> to include JavaScript and Ruby in that list. Hopefully the list will
>> keep growing longer!
>>
>> On Sat, Oct 21, 2017 at 4:20 PM, Julian Hyde <jh...@apache.org> wrote:
>> > Your proposed version is definitely an improvement.
>> >
>> >> "Apache Arrow is a cross-language development platform for in-memory
>> >> structured data access and analytics. It specifies a standardized
>> >> language-independent columnar memory format for flat and hierarchical
>> >> data, with support for zero-copy streaming messaging and interprocess
>> >> communication. It also provides computational libraries for efficient
>> >> in-memory analytics on modern hardware.”
>> >
>> > I propose a few tweaks:
>> >
>> > Simplify sentence 1 to
>> >
>> >   Apache Arrow is a cross-language development platform for in-memory
>> >   data.
>> >
>> > This is easier to parse, captures the gist, and the other parts are covered
>> > in later sentences.
>> >
>> > To me, the cache-efficient format is more fundamental important than
>> > streaming and IPC (you can build the latter). Therefore I’d change
>> > sentence 2 to
>> >
>> >   It specifies a standardized language-independent columnar memory
>> >   format for flat and hierarchical data, organized for efficient analytic
>> >   operations on modern hardware.
>> >
>> > Which leaves sentence 3 as
>> >
>> >   It also provides computational libraries for zero-copy streaming
>> >   messaging and interprocess communication.
>> >
>> > And add sentence 4,
>> >
>> >   Languages supported include C and C++, Java, and Python.
>> >
>> > Julian
>> >
>> >> On Oct 21, 2017, at 10:58 AM, Wes McKinney <wesmck...@gmail.com> wrote:
>> >>
>> >> I believe we would benefit from modified language to describe the
>> >> nature and scope of the Arrow project.
>> >>
>> >> Currently, our GitHub project description (and what we use in release
>> >> announcements) states:
>> >>
>> >> "Apache Arrow is a columnar in-memory analytics layer designed to
>> >> accelerate big data. It houses a set of canonical in-memory
>> >> representations of flat and hierarchical data along with multiple
>> >> language-bindings for structure manipulation. It also provides IPC and
>> >> common algorithm implementations."
>> >>
>> >> I think this could be perhaps restated in the following way:
>> >>
>> >> "Apache Arrow is a cross-language development platform for in-memory
>> >> structured data access and analytics. It specifies a standardized
>> >> language-independent columnar memory format for flat and hierarchical
>> >> data, with support for zero-copy streaming messaging and interprocess
>> >> communication. It also provides computational libraries for efficient
>> >> in-memory analytics on modern hardware."
>> >>
>> >> It is true that we have been mostly focused on hardening the details
>> >> of the Arrow format and related issues around messaging and IPC, which
>> >> are necessary for everything else we may contemplate building in the
>> >> future. Since I plan to be building a library of computational tools
>> >> in C++ for the native code community (Python, Ruby, R, etc.), I think
>> >> it would be a good idea to clearly state that building general purpose
>> >> analytics implementations (i.e. the sorts of things you find in "data
>> >> frame libraries" like pandas) is part of the mission of the project.
>> >>
>> >> Feedback on the above would be appreciated how we could do a better
>> >> job representing our past, present, and future community goals.
>> >>
>> >> Thanks
>> >> Wes
>> >

Reply via email to