> Still, I would like to see "columnar" used in the first sentence as this is > the main focus of the project.
It's interesting, slightly de-emphasizing the role of the columnar format is actually one of my objectives of the revisions. It does not mean that the columnar specification is not a critical component of the project: it absolutely is and one of centerpieces of the project. But the scope of Arrow has already become larger than that -- as time goes on the project's center of gravity concerns general management of in-memory analytical datasets. These may not be structured (and columnar) 100% of the time -- for example, you could use Arrow to write a collection of simple buffers (without any additional type metadata) to shared memory, then read them back with zero copy. This requires maintaining a general "memory management system" that is necessary for everything else, and the columnar format is built on top of this. It's pretty complex to be able to manage zero-copy memory references for arbitrarily complex I see the C++ library in 4 distinct layers, for example: * General zero-copy memory management: Plasma, arrow::Buffer, arrow::MemoryPool, the contents of arrow::io (e.g. zero-copy io::BufferReader, etc.) * Columnar memory format / data structures / in-memory metadata : arrow::DataType / Array * Structured data IPC: arrays, record batches, and any other new message types (e.g. tensors) * Columnar in-memory analytics: what we are just beginning to implement in arrow/compute I think to express to the open source community that in-memory data problems that are not columnar are of no interest to the Arrow community would be needlessly closing off collaboration opportunities. It's important that a larger audience is able to consume Arrow's memory management layer and IPC tools (e.g. they can easily be used for deep learning / ML applications) and use them to create more kinds of applications architected around the mantra of zero-copy. With new architectures designed to leverage non-volatile memory on the horizon, this grows more important with each passing day. - Wes On Sun, Oct 22, 2017 at 7:32 AM, Uwe L. Korn <uw...@xhochy.com> wrote: > Thank you Wes and Julian for taking the approach to improve the elevator > pitch. I really like the improvements. Still, I would like to see > "columnar" used in the first sentence as this is the main focus of the > project. > > Uwe > > On Sat, Oct 21, 2017, at 10:32 PM, Wes McKinney wrote: >> Thanks Julian, I like the changes. >> >> For the last part I agree listing languages is good; we would do well >> to include JavaScript and Ruby in that list. Hopefully the list will >> keep growing longer! >> >> On Sat, Oct 21, 2017 at 4:20 PM, Julian Hyde <jh...@apache.org> wrote: >> > Your proposed version is definitely an improvement. >> > >> >> "Apache Arrow is a cross-language development platform for in-memory >> >> structured data access and analytics. It specifies a standardized >> >> language-independent columnar memory format for flat and hierarchical >> >> data, with support for zero-copy streaming messaging and interprocess >> >> communication. It also provides computational libraries for efficient >> >> in-memory analytics on modern hardware.” >> > >> > I propose a few tweaks: >> > >> > Simplify sentence 1 to >> > >> > Apache Arrow is a cross-language development platform for in-memory >> > data. >> > >> > This is easier to parse, captures the gist, and the other parts are covered >> > in later sentences. >> > >> > To me, the cache-efficient format is more fundamental important than >> > streaming and IPC (you can build the latter). Therefore I’d change >> > sentence 2 to >> > >> > It specifies a standardized language-independent columnar memory >> > format for flat and hierarchical data, organized for efficient analytic >> > operations on modern hardware. >> > >> > Which leaves sentence 3 as >> > >> > It also provides computational libraries for zero-copy streaming >> > messaging and interprocess communication. >> > >> > And add sentence 4, >> > >> > Languages supported include C and C++, Java, and Python. >> > >> > Julian >> > >> >> On Oct 21, 2017, at 10:58 AM, Wes McKinney <wesmck...@gmail.com> wrote: >> >> >> >> I believe we would benefit from modified language to describe the >> >> nature and scope of the Arrow project. >> >> >> >> Currently, our GitHub project description (and what we use in release >> >> announcements) states: >> >> >> >> "Apache Arrow is a columnar in-memory analytics layer designed to >> >> accelerate big data. It houses a set of canonical in-memory >> >> representations of flat and hierarchical data along with multiple >> >> language-bindings for structure manipulation. It also provides IPC and >> >> common algorithm implementations." >> >> >> >> I think this could be perhaps restated in the following way: >> >> >> >> "Apache Arrow is a cross-language development platform for in-memory >> >> structured data access and analytics. It specifies a standardized >> >> language-independent columnar memory format for flat and hierarchical >> >> data, with support for zero-copy streaming messaging and interprocess >> >> communication. It also provides computational libraries for efficient >> >> in-memory analytics on modern hardware." >> >> >> >> It is true that we have been mostly focused on hardening the details >> >> of the Arrow format and related issues around messaging and IPC, which >> >> are necessary for everything else we may contemplate building in the >> >> future. Since I plan to be building a library of computational tools >> >> in C++ for the native code community (Python, Ruby, R, etc.), I think >> >> it would be a good idea to clearly state that building general purpose >> >> analytics implementations (i.e. the sorts of things you find in "data >> >> frame libraries" like pandas) is part of the mission of the project. >> >> >> >> Feedback on the above would be appreciated how we could do a better >> >> job representing our past, present, and future community goals. >> >> >> >> Thanks >> >> Wes >> >