Re: [DISCUSS] Updating Arrow's "elevator pitch" on web properties

Wes McKinney Fri, 27 Oct 2017 11:48:54 -0700

Here's a tweaked version of Julian's edits in 4 bullet points

1. Apache Arrow is a cross-language development platform for in-memory
   data.
2. It specifies a standardized language-independent columnar memory format for
   flat and hierarchical data, organized for efficient analytic operations on
   modern hardware.
3. It also provides computational libraries and zero-copy streaming messaging
   and interprocess communication.
4. Languages currently supported include C, C++, Java, JavaScript,
Python, and Ruby.


My comments to these points:

1. Arrow's scope as a "hub" for in-memory data is larger than the
columnar format. I think to lead with "columnar in-memory analytics"
would weaken the project's position for users who do not exclusively
work with columnar data, and also may limit the number of people who
jump to the immediate conclusion that Arrow "is the same as Parquet".
We obviously need to have a FAQ on the website where we address such
confusions more directly

2. The columnar format specification is one of the keystones of the project

3. We are building computation and messaging libraries to be
companions to the columnar format and memory management

4. We support many languages (I added "currently" to imply that we are
not closed to new languages)

- Wes

On Sun, Oct 22, 2017 at 11:04 PM, Julian Hyde <jh...@apache.org> wrote:
> It's best if a project's (or company's) marketing has several tiers.
> An "elevator pitch" of 2-3 sentences, a "high concept pitch" which is
> a phrase, e.g. "book rooms with locals, rather than hotels", and
> expanded description.
>
> I think the question of whether this replaces Avro is best handled in an FAQ.
>
> On Sun, Oct 22, 2017 at 5:35 AM, Wes McKinney <wesmck...@gmail.com> wrote:
>>> But my concern is that I saw some time ago some people questioning "Is 
>>> Arrow a replacement for Avro?" (also Flatbuffers seems to be something we 
>>> get
>> often compared to). For at least these two cases, I see that we want
>> to achieve different goals. We want to work with them together to
>> build a better data analytics ecosystem but at least from my
>> perspective, we don't want to replace all existing serialization
>> formats.
>>
>> Indeed, the most common problem I have experienced is that people who
>> do not build data processing engines professionally sometimes get
>> confused about the distinction between in-memory formats and
>> serialization formats (Parquet, Avro, Protocol Buffers, etc.). The
>> vast majority of developers rarely get this "close to the metal" and
>> mainly think about storage formats and data access layers in terms of
>> their high level semantics like "tables" and "records".
>>
>> The distinction between Arrow and zero-copy serialization formats like
>> Flatbuffers and Cap'n Proto is another thing that I often find myself
>> explaining. I don't think there's any way we can resolve these
>> confusions in ~100 words.
>>
>> I would like for us to write some blog posts helping people mentally
>> classify the technologies since it would help people understand both
>> how Arrow is different as well as how it is a complementary / not
>> mutually exclusive technology. I find that programmers are sometimes
>> prone to dichotomous / binary thinking (which leads to the inclination
>> to cast one technology as "the same as" another) and it's rare that a
>> new, category-defining technology like this comes along. People even
>> hear the "columnar" buzzword and then ask "wait, so is this replacing
>> Parquet?".
>>
>> The audience for the Arrow project are the developers of data
>> processing engines. We need to precisely message that developers who
>> work with complex in-memory data sets (especially using shared memory
>> and memory-mappable devices like GPUs and NVM), even if they are not
>> always columnar / structured, are welcome and indeed desired members
>> of our community. As an example, our collaboration with the Ray
>> project has been a success (and bodes well for use in more machine
>> learning applications) because we can compose our zero-copy structured
>> data representation with general buffer memory management to create
>> richer, memory-efficient data access interfaces.
>>
>> I'll spend a little time tweaking the blurb a bit based on Julian's
>> edits and post for more feedback.
>>
>> - Wes
>>
>> On Sun, Oct 22, 2017 at 8:01 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>>> I clearly understand that all four layers are important to Arrow (and we
>>> should mention them, maybe graphically) on the Arrow landing page. But
>>> my concern is that I saw some time ago some people questioning "Is Arrow
>>> a replacement for Avro?" (also Flatbuffers seems to be something we get
>>> often compared to). For at least these two cases, I see that we want to
>>> achieve different goals. We want to work with them together to build a
>>> better data analytics ecosystem but at least from my perspective, we
>>> don't want to replace all existing serialization formats. One of the
>>> main points that people should show that there is a boundary in Arrow's
>>> scope is the "in-memory" objective but I still would like to keep the
>>> "columnar" somewhere in the description. It might be slightly
>>> de-emphasized but it is still there as one of the focal point. From my
>>> perspective, 3 of the four layers are still very much focused on
>>> columnar memory.
>>>
>>> Uwe
>>>
>>> On Sun, Oct 22, 2017, at 01:46 PM, Wes McKinney wrote:
>>>> > Still, I would like to see "columnar" used in the first sentence as this 
>>>> > is the main focus of the project.
>>>>
>>>> It's interesting, slightly de-emphasizing the role of the columnar
>>>> format is actually one of my objectives of the revisions. It does not
>>>> mean that the columnar specification is not a critical component of
>>>> the project: it absolutely is and one of centerpieces of the project.
>>>>
>>>> But the scope of Arrow has already become larger than that -- as time
>>>> goes on the project's center of gravity concerns general management of
>>>> in-memory analytical datasets. These may not be structured (and
>>>> columnar) 100% of the time -- for example, you could use Arrow to
>>>> write a collection of simple buffers (without any additional type
>>>> metadata) to shared memory, then read them back with zero copy. This
>>>> requires maintaining a general "memory management system" that is
>>>> necessary for everything else, and the columnar format is built on top
>>>> of this. It's pretty complex to be able to manage zero-copy memory
>>>> references for arbitrarily complex
>>>>
>>>> I see the C++ library in 4 distinct layers, for example:
>>>>
>>>> * General zero-copy memory management: Plasma, arrow::Buffer,
>>>> arrow::MemoryPool, the contents of arrow::io (e.g. zero-copy
>>>> io::BufferReader, etc.)
>>>> * Columnar memory format / data structures / in-memory metadata :
>>>> arrow::DataType / Array
>>>> * Structured data IPC: arrays, record batches, and any other new
>>>> message types (e.g. tensors)
>>>> * Columnar in-memory analytics: what we are just beginning to
>>>> implement in arrow/compute
>>>>
>>>> I think to express to the open source community that in-memory data
>>>> problems that are not columnar are of no interest to the Arrow
>>>> community would be needlessly closing off collaboration opportunities.
>>>> It's important that a larger audience is able to consume Arrow's
>>>> memory management layer and IPC tools (e.g. they can easily be used
>>>> for deep learning / ML applications) and use them to create more kinds
>>>> of applications architected around the mantra of zero-copy. With new
>>>> architectures designed to leverage non-volatile memory on the horizon,
>>>> this grows more important with each passing day.
>>>>
>>>> - Wes
>>>>
>>>> On Sun, Oct 22, 2017 at 7:32 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>>>> > Thank you Wes and Julian for taking the approach to improve the elevator
>>>> > pitch. I really like the improvements. Still, I would like to see
>>>> > "columnar" used in the first sentence as this is the main focus of the
>>>> > project.
>>>> >
>>>> > Uwe
>>>> >
>>>> > On Sat, Oct 21, 2017, at 10:32 PM, Wes McKinney wrote:
>>>> >> Thanks Julian, I like the changes.
>>>> >>
>>>> >> For the last part I agree listing languages is good; we would do well
>>>> >> to include JavaScript and Ruby in that list. Hopefully the list will
>>>> >> keep growing longer!
>>>> >>
>>>> >> On Sat, Oct 21, 2017 at 4:20 PM, Julian Hyde <jh...@apache.org> wrote:
>>>> >> > Your proposed version is definitely an improvement.
>>>> >> >
>>>> >> >> "Apache Arrow is a cross-language development platform for in-memory
>>>> >> >> structured data access and analytics. It specifies a standardized
>>>> >> >> language-independent columnar memory format for flat and hierarchical
>>>> >> >> data, with support for zero-copy streaming messaging and interprocess
>>>> >> >> communication. It also provides computational libraries for efficient
>>>> >> >> in-memory analytics on modern hardware.”
>>>> >> >
>>>> >> > I propose a few tweaks:
>>>> >> >
>>>> >> > Simplify sentence 1 to
>>>> >> >
>>>> >> >   Apache Arrow is a cross-language development platform for in-memory
>>>> >> >   data.
>>>> >> >
>>>> >> > This is easier to parse, captures the gist, and the other parts are 
>>>> >> > covered
>>>> >> > in later sentences.
>>>> >> >
>>>> >> > To me, the cache-efficient format is more fundamental important than
>>>> >> > streaming and IPC (you can build the latter). Therefore I’d change
>>>> >> > sentence 2 to
>>>> >> >
>>>> >> >   It specifies a standardized language-independent columnar memory
>>>> >> >   format for flat and hierarchical data, organized for efficient 
>>>> >> > analytic
>>>> >> >   operations on modern hardware.
>>>> >> >
>>>> >> > Which leaves sentence 3 as
>>>> >> >
>>>> >> >   It also provides computational libraries for zero-copy streaming
>>>> >> >   messaging and interprocess communication.
>>>> >> >
>>>> >> > And add sentence 4,
>>>> >> >
>>>> >> >   Languages supported include C and C++, Java, and Python.
>>>> >> >
>>>> >> > Julian
>>>> >> >
>>>> >> >> On Oct 21, 2017, at 10:58 AM, Wes McKinney <wesmck...@gmail.com> 
>>>> >> >> wrote:
>>>> >> >>
>>>> >> >> I believe we would benefit from modified language to describe the
>>>> >> >> nature and scope of the Arrow project.
>>>> >> >>
>>>> >> >> Currently, our GitHub project description (and what we use in release
>>>> >> >> announcements) states:
>>>> >> >>
>>>> >> >> "Apache Arrow is a columnar in-memory analytics layer designed to
>>>> >> >> accelerate big data. It houses a set of canonical in-memory
>>>> >> >> representations of flat and hierarchical data along with multiple
>>>> >> >> language-bindings for structure manipulation. It also provides IPC 
>>>> >> >> and
>>>> >> >> common algorithm implementations."
>>>> >> >>
>>>> >> >> I think this could be perhaps restated in the following way:
>>>> >> >>
>>>> >> >> "Apache Arrow is a cross-language development platform for in-memory
>>>> >> >> structured data access and analytics. It specifies a standardized
>>>> >> >> language-independent columnar memory format for flat and hierarchical
>>>> >> >> data, with support for zero-copy streaming messaging and interprocess
>>>> >> >> communication. It also provides computational libraries for efficient
>>>> >> >> in-memory analytics on modern hardware."
>>>> >> >>
>>>> >> >> It is true that we have been mostly focused on hardening the details
>>>> >> >> of the Arrow format and related issues around messaging and IPC, 
>>>> >> >> which
>>>> >> >> are necessary for everything else we may contemplate building in the
>>>> >> >> future. Since I plan to be building a library of computational tools
>>>> >> >> in C++ for the native code community (Python, Ruby, R, etc.), I think
>>>> >> >> it would be a good idea to clearly state that building general 
>>>> >> >> purpose
>>>> >> >> analytics implementations (i.e. the sorts of things you find in "data
>>>> >> >> frame libraries" like pandas) is part of the mission of the project.
>>>> >> >>
>>>> >> >> Feedback on the above would be appreciated how we could do a better
>>>> >> >> job representing our past, present, and future community goals.
>>>> >> >>
>>>> >> >> Thanks
>>>> >> >> Wes
>>>> >> >

Re: [DISCUSS] Updating Arrow's "elevator pitch" on web properties

Reply via email to