Re: [DISCUSS] FLIP-540: Support VECTOR_SEARCH in Flink SQL

Shengkai Fang Thu, 14 Aug 2025 22:13:09 -0700

Hi Timo, thank you for your detailed suggestions. Please see my responses
below.


1) ProcTime

+1 for aligning the behavior with PTF. I’ve updated the FLIP accordingly.

2) RowTime

I have some concerns regarding the `ROWTIME` handling. Let me illustrate
with an example.

Suppose the input table schema is:
`<query_col ARRAY<FLOAT>, ts TIMESTAMP(3) *ROWTIME*>`
and the vector table schema is:
`<id INT, search_col ARRAY<FLOAT>>`

Using the following SQL:
```sql
SELECT * FROM input_table, LATERAL TABLE(VECTOR_SEARCH(
   SEARCH_TABLE => TABLE vector_table,
   COLUMN_TO_SEARCH => DESCRIPTOR(search_col),
   COLUMN_TO_QUERY => input_table.query_col,
   ON_TIME => input_table.ts))
```

The output schema becomes:
ROW<query_col ARRAY<FLOAT>, ts TIMESTAMP(3), id INT, search_col
ARRAY<FLOAT>, score DOUBLE, ts0 TIMESTAMP(3)>

This results in two timestamp fields: ts (from input) and ts0 (generated by
the operator).
Having both may be confusing. Is this the intended behavior?

3) Naming

I did consider SEARCH_VECTOR, but many vendors use VECTOR_SEARCH — for
example, Spark[1] and BigQuery[2].
To maintain consistency and reduce the learning curve, I suggest aligning
with existing industry practice.


[1]
https://docs.databricks.com/aws/en/sql/language-manual/functions/vector_search
[2] https://cloud.google.com/bigquery/docs/vector-search-intro

Best,
Shengkai

Timo Walther <twal...@apache.org> 于2025年8月14日周四 21:49写道：

> Hi Shengkai,
>
> thank you for proposing this FLIP. Also, thank you for considering my
> thoughts from FLIP-517, even though I haven't managed to finalize the
> discussion/voting yet.
>
> It looks mostly good to me. However, I would like to discuss the
> semantics of the `on_time` parameter:
>
> 1) Proctime
>
> I truly believe we should avoid the need for a `proctime` attribute.
> Teaching the rowtime attributes to users is already painful enough, but
> additionally teaching proctime is worse. For PTFs of FLIP-440, only
> rowtime attributes can be used in f(on_time => ...) and we should do the
> same for future built-in PTFs. Not specifying `on_time` can be equal to
> proctime.
>
> So users can just naturally use the PTF, with the mental model of
> LITERAL being a foreach loop where each invocation happens instantly (in
> processing time).
>
> 2) Rowtime
>
> All PTFs should follow the SystemTypeInference:
>
>
> https://github.com/apache/flink/blob/master/flink-table/flink-table-common/src/main/java/org/apache/flink/table/types/inference/SystemTypeInference.java#L239
>
> It assumes that when an `on_time`  parameter is passed, the result
> appends a `rowtime` column that can be used in subsequent time based
> operations. Can we add such a column in the output for VECTOR_SEARCH as
> well?
>
> 3) Naming
>
> Just a general note, feel free to ignore: A function or operationshould
> use a verb not a noun. E.g. JOIN, SEARCH, SELECT. Vector search is a
> concept. The function should rather be called `SEARCH_VECTOR`. This was
> also explained in FLIP-517.
>
> Thanks,
> Timo
>
>
> On 14.08.25 03:31, Shengkai Fang wrote:
> > Hi, all.
> >
> > There has been no feedback for a while. I plan to close this FLIP
> tomorrow
> > unless there are further comments. Thank you all for the discussion.
> >
> > Best,
> > Shengkai
> >
> > Yash Anand <yashanand.0...@gmail.com> 于2025年7月31日周四 15:47写道：
> >
> >> Hi Shengkai,
> >>
> >> Thanks for the FLIP, this will be a great addition to flink AI
> >> capabilities. +1 for this feature.
> >>
> >> Best,
> >> Yash Anand
> >>
> >> On Tue, Jul 29, 2025 at 7:23 PM Jacky Lau <liuyong...@gmail.com> wrote:
> >>
> >>> Hi Shengkai,
> >>>
> >>> Thanks for the FLIP and enhancement for AI capabilities in Flink. +1
> for
> >>> this feature
> >>>
> >>> Best,
> >>> Jacky Lau
> >>>
> >>> Hao Li <h...@confluent.io.invalid> 于2025年7月30日周三 01:03写道：
> >>>
> >>>> Hi Shengkai,
> >>>>
> >>>> Thanks for the FLIP and enhancement for AI capabilities in Flink. +1.
> >>>>
> >>>> Thanks,
> >>>> Hao
> >>>>
> >>>> On Tue, Jul 29, 2025 at 2:16 AM Shengkai Fang <fskm...@gmail.com>
> >> wrote:
> >>>>
> >>>>> Hi,
> >>>>> I'd like to start a discussion of FLIP-540: Support VECTOR_SEARCH in
> >>>> Flink
> >>>>> SQL[1].
> >>>>>
> >>>>> In FLIP-437/FLIP-525, Apache Flink has initially integrated Large
> >>>> Language
> >>>>> Model (LLM) capabilities, enabling semantic understanding and
> >> real-time
> >>>>> processing of streaming data pipelines. This integration has been
> >>>>> technically validated in scenarios such as log classification and
> >>>> real-time
> >>>>> question-answering systems. However, the current architecture allows
> >>>> Flink
> >>>>> to only use embedding models to convert unstructured data (e.g.,
> >> text,
> >>>>> images) into high-dimensional vector features, which are then
> >> persisted
> >>>> to
> >>>>> downstream storage systems (e.g., Milvus, Mongodb). It lacks
> >> real-time
> >>>>> online querying and similarity analysis capabilities for vector
> >> spaces.
> >>>> To
> >>>>> address this limitation, we propose introducing the VECTOR_SEARCH
> >>>> function
> >>>>> in this FLIP, enabling users to perform streaming vector similarity
> >>>>> searches and real-time context retrieval (e.g., Retrieval-Augmented
> >>>>> Generation, RAG) directly within Flink.
> >>>>>
> >>>>> Looking forward to comments and suggestions for improvements!
> >>>>>
> >>>>> Best,
> >>>>> Shengkai
> >>>>>
> >>>>> [1]
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-540%3A+Support+VECTOR_SEARCH+in+Flink+SQL
> >>>>>
> >>>>
> >>>
> >>
> >
>
>

Re: [DISCUSS] FLIP-540: Support VECTOR_SEARCH in Flink SQL

Reply via email to