Re: Long title on github page

Dominik Moritz Thu, 10 Jun 2021 10:37:38 -0700

I thought there were some good suggestions in this thread. @Wes, did you
find a description you liked?


On May 18, 2021 at 06:24:47, Adam Hooper <[email protected]> wrote:

> Poll question: why did you choose Arrow?
>
> Personally: I researched Arrow because it's a spec for IPC. (My requirement
> was: "wrap computations in a separate process.") I chose Arrow for its
> community and ecosystem -- in other words, because my peers chose it.
>
> I happen to use the compute kernel and Parquet capabilities every day; but
> they did not sway me at all. I would choose Arrow if it were nothing but
> this spec and this community. (I chose HTML, after all.)
>
> I see the *code* as one enormous proof that the *spec* is good, and as a
> collection of examples and best practices.
>
> ... so a great pitch to me would be: "Apache Arrow is a data format and
> toolbox for efficient in-memory processing."
>
> Enjoy life,
> Adam
>
> On Tue, May 18, 2021 at 2:38 AM Aldrin <[email protected]> wrote:
>
> "Apache Arrow is a data processing library that also provides a uniform,
>
> efficient interface for data systems."
>
>
> This probably still isn't quite right, I imagine the bit about "for data
>
> systems" needs some addition (maybe "for transport between data systems")?
>
>
> My primary motivators:
>
>
>    - "A data processing library":
>
>       - Arrow provides many language bindings, but ultimately they're all
>
>       part of the same "library ecosystem", which I think is fine to
>
> capture in
>
>       "library"
>
>       - A main goal of arrow is for processing to be fast, whatever that
>
>       processing may be
>
>       - "uniform, efficient interface for data systems":
>
>       - Arrow, provides (or tries to) a cohesive ("uniform") interface for
>
>       data processing (although it has several APIs to do this)
>
>       - Also, IMO, a motivation for arrow was a format and library to
>
>       facilitate processing, but that provided functions and
>
> interfaces to easily
>
>       translate into optimized data formats used by disparate data systems
>
>       (cassandra, hadoop, etc.).
>
>       - Arrow tries to be transparently zero-copy, which is part of the
>
>       interface for efficiency
>
>    - Arrow certainly has a data format, but that format is the crux of the
>
>    interface (IMO). However, it also makes using other formats easy (via
>
>    filesystem API and parquet reader/writers, etc.). So, focusing on the
>
> data
>
>    format seems unnecessary in such a terse description.
>
>
>
> Aldrin Montana
>
> Computer Science PhD Student
>
> UC Santa Cruz
>
>
>
> On Mon, May 17, 2021 at 5:07 PM Weston Pace <[email protected]> wrote:
>
>
> > I'd avoid the word "structured" as it is somewhat ill-defined.
>
> >
>
> > On Mon, May 17, 2021 at 12:37 PM Mauricio Vargas
>
> > <[email protected]> wrote:
>
> > >
>
> > > more marketed:
>
> > > How about: "Apache Arrow is a format and language-agnostic library
>
> > focused
>
> > > on efficient sharing and processing of structured data."
>
> > >
>
> > > On Mon, May 17, 2021 at 6:25 PM Micah Kornfield <[email protected]
>
> >
>
> > > wrote:
>
> > >
>
> > > > How about: "Apache Arrow is a collection of specifications, cross
>
> > language
>
> > > > libraries and applications focused on efficient sharing and
>
> processing
>
> > of
>
> > > > structured data."
>
> > > >
>
> > > > On Mon, May 17, 2021 at 3:06 PM Wes McKinney <[email protected]>
>
> > wrote:
>
> > > >
>
> > > > > On Mon, May 17, 2021 at 4:58 PM Weston Pace <[email protected]
>
> >
>
> > > > wrote:
>
> > > > > >
>
> > > > > > > “Apache Arrow is a format and compute kernel for in-memory
>
> data”
>
> > > > > >
>
> > > > > > I like this but no one ever knows what "in-memory" means (or they
>
> > just
>
> > > > > > think 'data is always in memory').  How about...
>
> > > > > >
>
> > > > > > "Apache Arrow is a format and compute kernel for zero-copy
>
> > processing
>
> > > > > > and sharing of data."
>
> > > > > >
>
> > > > > > or...
>
> > > > > >
>
> > > > > > "Apache Arrow is a format and compute kernel for processing and
>
> > > > > > sharing data without serialization overhead."
>
> > > > >
>
> > > > > A few issues with this:
>
> > > > >
>
> > > > > * Multiple PL aspect unclear (is a single piece of software, or
>
> > > > > multiple pieces of software?)
>
> > > > > * Development platform aspect unclear
>
> > > > >
>
> > > > > I see that some people don't like the word "platform". Some people
>
> > > > > come to this project and want to find an end-to-end application,
>
> > > > > rather than a developer toolkit that they can use to build
>
> > > > > applications. Perhaps we should be more explicit and use
>
> > > > > "computational development toolkit" instead of "platform".
>
> > > > >
>
> > > > > > Although marshalling[1] would probably be a more precise word it
>
> is
>
> > > > > > not as well known.
>
> > > > > >
>
> > > > > > [1] https://en.wikipedia.org/wiki/Marshalling_(computer_science)
>
> > > > > >
>
> > > > > > On Mon, May 17, 2021 at 9:36 AM Mauricio Vargas
>
> > > > > > <[email protected]> wrote:
>
> > > > > > >
>
> > > > > > > a few ideas
>
> > > > > > >
>
> > > > > > > github.com/apache/arrow - Apache Arrow is an efficient library
>
> > for
>
> > > > > big data
>
> > > > > > > processing and sharing
>
> > > > > > >
>
> > > > > > > github.com/apache/arrow - Apache Arrow is a computational tool
>
> > for
>
> > > > > > > processing, storing and sharing large datasets
>
> > > > > > >
>
> > > > > > > github.com/apache/arrow - Apache Arrow is a  fast and simple
>
> > library
>
> > > > > for
>
> > > > > > > big data analytics
>
> > > > > > >
>
> > > > > > > *github.com/apache/arrow <http://github.com/apache/arrow> -
>
> > Apache
>
> > > > > Arrow is
>
> > > > > > > a powerful workhorse for analytic operations on modern
>
> hardware*
>
> > > > > > >
>
> > > > > > >
>
> > > > > > > On Mon, May 17, 2021 at 3:13 PM Julian Hyde <
>
> > [email protected]>
>
> > > > > wrote:
>
> > > > > > >
>
> > > > > > > > Alright, well, whatever it is, it must fit into one breath.
>
> If
>
> > the
>
> > > > > > > > high-concept pitch is successful, people will stick around
>
> for
>
> > the
>
> > > > > full
>
> > > > > > > > pitch.
>
> > > > > > > >
>
> > > > > > > > Words such as “platform” and “enable” are noise. You say
>
> > > > “platform”,
>
> > > > > they
>
> > > > > > > > start to say “what exactly do you mean by platform”, the
>
> > elevator
>
> > > > > doors
>
> > > > > > > > open, and they’re gone.
>
> > > > > > > >
>
> > > > > > > > “Apache Arrow is a format and compute kernel for in-memory
>
> > data”
>
> > > > > > > >
>
> > > > > > > >
>
> > > > > > > > > On May 17, 2021, at 12:03 PM, Eduardo Ponce <
>
> > [email protected]
>
> > > > >
>
> > > > > wrote:
>
> > > > > > > > >
>
> > > > > > > > > One more suggestion for the bucket:
>
> > > > > > > > > "Apache Arrow is a computational platform for efficient
>
> > in-memory
>
> > > > > data
>
> > > > > > > > > representation and processing."
>
> > > > > > > > >
>
> > > > > > > > > On Mon, May 17, 2021 at 2:49 PM Wes McKinney <
>
> > > > [email protected]>
>
> > > > > > > > wrote:
>
> > > > > > > > >
>
> > > > > > > > >> I think less is better in the description, but
>
> > unfortunately the
>
> > > > > > > > >> association of Arrow as being "just a data format" has
>
> been
>
> > > > > actively
>
> > > > > > > > >> harmful in some ways to community growth. We have a data
>
> > format,
>
> > > > > yes,
>
> > > > > > > > >> but we are also creating a computational platform to go
>
> > > > > hand-in-hand
>
> > > > > > > > >> with the data format to make it easier to build fast
>
> > > > applications
>
> > > > > that
>
> > > > > > > > >> use the data format. So the description needs to capture
>
> > both of
>
> > > > > these
>
> > > > > > > > >> ideas.
>
> > > > > > > > >>
>
> > > > > > > > >> On Mon, May 17, 2021 at 12:15 PM Julian Hyde <
>
> > > > > [email protected]>
>
> > > > > > > > >> wrote:
>
> > > > > > > > >>>
>
> > > > > > > > >>> I think that the “cross-language development platform
>
> for”
>
> > is
>
> > > > > noise.
>
> > > > > > > > >> (I’m sure that JPEG developers think that JPEG is a
>
> > > > > “cross-language
>
> > > > > > > > >> development platform” too. But it isn’t. It is an image
>
> > format.)
>
> > > > > > > > >>>
>
> > > > > > > > >>> "Apache Arrow is data format for efficient in-memory
>
> > > > processing.”
>
> > > > > > > > >>>
>
> > > > > > > > >>> I’ll note that In marketing speak, we are developing a
>
> > > > > high-concept
>
> > > > > > > > >> pitch [1] here. Every company needs a name, a brand, a
>
> > > > > high-concept
>
> > > > > > > > pitch,
>
> > > > > > > > >> and 3- or 4-sentence description. But every Apache project
>
> > needs
>
> > > > > these
>
> > > > > > > > too.
>
> > > > > > > > >> It’s worth spending the time on the description, also, and
>
> > then
>
> > > > > use
>
> > > > > > > > them in
>
> > > > > > > > >> all the places that we describe Arrow.
>
> > > > > > > > >>>
>
> > > > > > > > >>> Julian
>
> > > > > > > > >>>
>
> > > > > > > > >>> [1]
>
> > > > > https://www.growthink.com/content/whats-your-high-concept-pitch
>
> > > > > > > > >>>
>
> > > > > > > > >>>
>
> > > > > > > > >>>
>
> > > > > > > > >>>> On May 17, 2021, at 7:38 AM, Eduardo Ponce <
>
> > > > [email protected]
>
> > > > > >
>
> > > > > > > > >> wrote:
>
> > > > > > > > >>>>
>
> > > > > > > > >>>> I agree with Nate's and Brian's suggestions, but would
>
> > like to
>
> > > > > add
>
> > > > > > > > >> that we
>
> > > > > > > > >>>> can make it a one-liner for more conciseness and
>
> > consistency
>
> > > > > with
>
> > > > > > > > other
>
> > > > > > > > >>>> Apache projects.
>
> > > > > > > > >>>> Apologies if it seems I am going around the suggestions
>
> > loop
>
> > > > > again.
>
> > > > > > > > >>>>
>
> > > > > > > > >>>> "Apache Arrow is a cross-language development platform
>
> > > > enabling
>
> > > > > > > > >> efficient
>
> > > > > > > > >>>> in-memory data processing and transport."
>
> > > > > > > > >>>>
>
> > > > > > > > >>>>
>
> > > > > > > > >>>>
>
> > > > > > > > >>>>
>
> > > > > > > > >>>> On Mon, May 17, 2021 at 10:11 AM Brian Hulette <
>
> > > > > [email protected]>
>
> > > > > > > > >> wrote:
>
> > > > > > > > >>>>
>
> > > > > > > > >>>>> Thank you for bringing this up Dominik. I sampled some
>
> > of the
>
> > > > > > > > >> descriptions
>
> > > > > > > > >>>>> for other Apache projects I frequent, the ones with a
>
> > > > > meaningful
>
> > > > > > > > >>>>> description have a single sentence:
>
> > > > > > > > >>>>>
>
> > > > > > > > >>>>> github.com/apache/spark - Apache Spark - A unified
>
> > analytics
>
> > > > > engine
>
> > > > > > > > >> for
>
> > > > > > > > >>>>> large-scale data processing
>
> > > > > > > > >>>>> github.com/apache/beam - Apache Beam is a unified
>
> > > > programming
>
> > > > > model
>
> > > > > > > > >> for
>
> > > > > > > > >>>>> Batch and Streaming
>
> > > > > > > > >>>>> github.com/apache/avro - Apache Avro is a data
>
> > serialization
>
> > > > > system
>
> > > > > > > > >>>>>
>
> > > > > > > > >>>>> Several others (Flink, Hadoop, ...) just have  "[Mirror
>
> > of]
>
> > > > > Apache
>
> > > > > > > > >> <name>"
>
> > > > > > > > >>>>> as the description.
>
> > > > > > > > >>>>>
>
> > > > > > > > >>>>> +1 for Nate's suggestion "Apache Arrow is a
>
> > cross-language
>
> > > > > > > > development
>
> > > > > > > > >>>>> platform for in-memory data. It enables systems to
>
> > process
>
> > > > and
>
> > > > > > > > >> transport
>
> > > > > > > > >>>>> data more efficiently."
>
> > > > > > > > >>>>>
>
> > > > > > > > >>>>> On Mon, May 17, 2021 at 5:23 AM Wes McKinney <
>
> > > > > [email protected]>
>
> > > > > > > > >> wrote:
>
> > > > > > > > >>>>>
>
> > > > > > > > >>>>>> It's probably best for description to limit mentions
>
> of
>
> > > > > specific
>
> > > > > > > > >>>>>> features. There are some high level features mentioned
>
> > in
>
> > > > the
>
> > > > > > > > >>>>>> description now ("computational libraries and
>
> zero-copy
>
> > > > > streaming
>
> > > > > > > > >>>>>> messaging and interprocess communication"), but now in
>
> > 2021
>
> > > > > since
>
> > > > > > > > the
>
> > > > > > > > >>>>>> project has grown so much, it could leave people with
>
> a
>
> > > > > limited view
>
> > > > > > > > >>>>>> of what they might find here.
>
> > > > > > > > >>>>>>
>
> > > > > > > > >>>>>> On Mon, May 17, 2021 at 12:14 AM Mauricio Vargas
>
> > > > > > > > >>>>>> <[email protected]> wrote:
>
> > > > > > > > >>>>>>>
>
> > > > > > > > >>>>>>> How about
>
> > > > > > > > >>>>>>> 'Apache Arrow is a cross-language development
>
> platform
>
> > for
>
> > > > > > > > in-memory
>
> > > > > > > > >>>>>> data.
>
> > > > > > > > >>>>>>> It enables systems to process and transport data
>
> > > > efficiently,
>
> > > > > > > > >>>>> providing a
>
> > > > > > > > >>>>>>> simple and fast library for partitioning of large
>
> > tables'?
>
> > > > > > > > >>>>>>>
>
> > > > > > > > >>>>>>> Sorry the delay, long election day
>
> > > > > > > > >>>>>>>
>
> > > > > > > > >>>>>>> On Sun, May 16, 2021, 2:27 PM Nate Bauernfeind <
>
> > > > > > > > >>>>>> [email protected]>
>
> > > > > > > > >>>>>>> wrote:
>
> > > > > > > > >>>>>>>
>
> > > > > > > > >>>>>>>> Suggestion: faster -> more efficiently
>
> > > > > > > > >>>>>>>>
>
> > > > > > > > >>>>>>>> "Apache Arrow is a cross-language development
>
> > platform for
>
> > > > > > > > >> in-memory
>
> > > > > > > > >>>>>>>> data. It enables systems to process and transport
>
> data
>
> > > > more
>
> > > > > > > > >>>>>> efficiently."
>
> > > > > > > > >>>>>>>>
>
> > > > > > > > >>>>>>>> On Sun, May 16, 2021 at 11:35 AM Wes McKinney <
>
> > > > > > > > [email protected]
>
> > > > > > > > >>>
>
> > > > > > > > >>>>>> wrote:
>
> > > > > > > > >>>>>>>>
>
> > > > > > > > >>>>>>>>> Here's what there now:
>
> > > > > > > > >>>>>>>>>
>
> > > > > > > > >>>>>>>>> "Apache Arrow is a cross-language development
>
> > platform
>
> > > > for
>
> > > > > > > > >>>>> in-memory
>
> > > > > > > > >>>>>>>>> data. It specifies a standardized
>
> > language-independent
>
> > > > > columnar
>
> > > > > > > > >>>>>> memory
>
> > > > > > > > >>>>>>>>> format for flat and hierarchical data, organized
>
> for
>
> > > > > efficient
>
> > > > > > > > >>>>>>>>> analytic operations on modern hardware. It also
>
> > provides
>
> > > > > > > > >>>>>> computational
>
> > > > > > > > >>>>>>>>> libraries and zero-copy streaming messaging and
>
> > > > > interprocess
>
> > > > > > > > >>>>>>>>> communication…"
>
> > > > > > > > >>>>>>>>>
>
> > > > > > > > >>>>>>>>> How about something shorter like
>
> > > > > > > > >>>>>>>>>
>
> > > > > > > > >>>>>>>>> "Apache Arrow is a cross-language development
>
> > platform
>
> > > > for
>
> > > > > > > > >>>>> in-memory
>
> > > > > > > > >>>>>>>>> data. It enables systems to process and transport
>
> > data
>
> > > > > faster."
>
> > > > > > > > >>>>>>>>>
>
> > > > > > > > >>>>>>>>> Suggestions / refinements from others welcome
>
> > > > > > > > >>>>>>>>>
>
> > > > > > > > >>>>>>>>>
>
> > > > > > > > >>>>>>>>> On Sat, May 15, 2021 at 9:12 PM Dominik Moritz <
>
> > > > > [email protected]
>
> > > > > > > > >
>
> > > > > > > > >>>>>> wrote:
>
> > > > > > > > >>>>>>>>>>
>
> > > > > > > > >>>>>>>>>> Super minor issue but could someone make the
>
> > description
>
> > > > > on
>
> > > > > > > > >>>>> GitHub
>
> > > > > > > > >>>>>>>>> shorter?
>
> > > > > > > > >>>>>>>>>>
>
> > > > > > > > >>>>>>>>>>
>
> > > > > > > > >>>>>>>>>>
>
> > > > > > > > >>>>>>>>>> GitHub puts the description into the title of the
>
> > page
>
> > > > > and makes
>
> > > > > > > > >>>>> it
>
> > > > > > > > >>>>>>>> hard
>
> > > > > > > > >>>>>>>>> to find it in URL autocomplete.
>
> > > > > > > > >>>>>>>>>>
>
> > > > > > > > >>>>>>>>>
>
> > > > > > > > >>>>>>>>
>
> > > > > > > > >>>>>>>>
>
> > > > > > > > >>>>>>>> --
>
> > > > > > > > >>>>>>>>
>
> > > > > > > > >>>>>>
>
> > > > > > > > >>>>>
>
> > > > > > > > >>>
>
> > > > > > > > >>
>
> > > > > > > >
>
> > > > > > > >
>
> > > > >
>
> > > >
>
> >
>
>
>
>
> --
> Adam Hooper
> +1-514-882-9694
> http://adamhooper.com
>

Re: Long title on github page

Reply via email to