> -----Original Message----- > From: ffmpeg-devel <ffmpeg-devel-boun...@ffmpeg.org> On Behalf Of Lynne > Sent: 2021年4月8日 12:14 > To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> > Subject: Re: [FFmpeg-devel] [PATCH V7 4/6] lavu: add side data > AV_FRAME_DATA_BOUNDING_BOXES > > Apr 8, 2021, 04:48 by yejun....@intel.com: > > >> > + > >> > +#ifndef AVUTIL_BOUNDINGBOX_H > >> > +#define AVUTIL_BOUNDINGBOX_H > >> > + > >> > +#include "rational.h" > >> > +#include "avassert.h" > >> > +#include "frame.h" > >> > + > >> > +typedef struct AVBoundingBox { > >> > + /** > >> > + * Distance in pixels from the top edge of the frame to top > >> > + * and bottom, and from the left edge of the frame to left and > >> > + * right, defining the bounding box. > >> > + */ > >> > + int top; > >> > + int left; > >> > + int bottom; > >> > + int right; > >> > > >> > >> Why not x, y, w, h? > >> It makes more sense, all of coordinates go like this. > >> > > > > We want to keep it consistent with 'struct AVRegionOfInterest', > > we also have a plan to add a filter which converts from bounding box > > to RegionOfInterest which impacts the encoder. > > > > Not a good idea. Region of interest is only useful to indicate a > single region, while this API describes multiple regions > within the frame. The video_enc_params API follows the > x, y, w, h because it's the easiest to work with, so I'm sure > it's well worth going to such coordinates. >
RegionOfInterest is similar with boundingbox, it is used for multiple regions in an array, see AV_FRAME_DATA_REGIONS_OF_INTEREST. anyway, I'm ok to change to x,y,w,h. btw, do we need to change to x,y,w,h for RegionOfInterest? Is such change allowed after current major version change? > >> > + > >> > +typedef struct AVBoundingBoxHeader { > >> > + /** > >> > + * Information about how the bounding box is generated. > >> > + * for example, the DNN model name. > >> > + */ > >> > + char source[128]; > >> > + > >> > + /** > >> > + * The size of frame when it is detected. > >> > + */ > >> > + int frame_width; > >> > + int frame_height; > >> > > >> > >> Why? This side data is attached to AVFrames only, where we > >> already have width and height. > >> > > > > The detection result will be used by other filters, for example, > > dnn_classify (see https://github.com/guoyejun/ffmpeg/tree/classify). > > > > The filter dnn_detect detects all the objects (cat, dog, person ...) in a > > frame, while dnn_classify classifies one detected object (for example, > > person) > > for its attribute (for example, emotion, etc.) > > > > The filter dnn_classify have to check if the frame size is changed after > > it is detected, to handle the below filter chain: > > dnn_detect -> scale -> dnn_classify > > > > This doesn't look good. Why is dnn_classify needing to know > the original frame size at all? For example, the original size of the frame is 100*100, and dnn_detect detects a face at place (10, 10) -> (30, 40), such data will be saved in AVBoundingBox.top/left/right/bottom. Then, the frame is scaled into 50*50. Then, dnn_classify is used to analyze the emotion of the face, it needs to know the frame size (100*100) when it is detected, otherwise, it does not work with just (10,10), (30,40) and 50*50. > >> > diff --git a/libavutil/frame.h b/libavutil/frame.h > >> > index a5ed91b20a..41e22de02a 100644 > >> > --- a/libavutil/frame.h > >> > +++ b/libavutil/frame.h > >> > @@ -198,6 +198,13 @@ enum AVFrameSideDataType { > >> > * Must be present for every frame which should have film grain applied. > >> > */ > >> > AV_FRAME_DATA_FILM_GRAIN_PARAMS, > >> > + > >> > + /** > >> > + * Bounding boxes for object detection and classification, the data > >> > is > a > >> AVBoundingBoxHeader > >> > + * followed with an array of AVBoudingBox, and > >> AVBoundingBoxHeader.nb_bboxes is the number > >> > + * of array element. > >> > + */ > >> > + AV_FRAME_DATA_BOUNDING_BOXES, > >> > }; > >> > > >> > >> Finally, why call it a Bounding Box? It's not descriptive at all. > >> How about "Object Classification"? It makes much more sense, it's > >> exactly what this is. So AVObjectClassification, AVObjectClassification, > >> AV_FRAME_DATA_OBJECT_CLASSIFICATION and so on. > >> > > > > In object detection papers, bounding box is usually used. > > We'd better use the same term, imho, thanks. > > > > Not in this case, API users won't have any idea what this is or what > it's for. This is user-facing code after all. > Papers in fields can get away with overloading language, but we're > trying to make a concise API. Object classification makes sense > because this is exactly what this is. The term bounding box is dominating the domain, for example, even HEVC spec uses this term, see page 317 of https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-H.265-201911-I!!PDF-E&type=items also copy some here for your convenient. ar_bounding_box_top[ ar_object_idx[ i ] ] u(16) ar_bounding_box_left[ ar_object_idx[ i ] ] u(16) ar_bounding_box_width[ ar_object_idx[ i ] ] u(16) ar_bounding_box_height[ ar_object_idx[ i ] ] u(16) I would prefer to use bounding box. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".