> -----Original Message----- > From: ffmpeg-devel <ffmpeg-devel-boun...@ffmpeg.org> On Behalf Of > Lynne > Sent: 2021年4月9日 23:16 > To: FFmpeg development discussions and patches > <ffmpeg-devel@ffmpeg.org> > Subject: Re: [FFmpeg-devel] [PATCH V7 4/6] lavu: add side data > AV_FRAME_DATA_BOUNDING_BOXES > > Apr 9, 2021, 16:35 by bygran...@gmail.com: > > > Em sex., 9 de abr. de 2021 às 01:13, Guo, Yejun <yejun....@intel.com> > escreveu: > > > >> > >> > >> > >> > -----Original Message----- > >> > From: ffmpeg-devel <ffmpeg-devel-boun...@ffmpeg.org> On Behalf > Of Lynne > >> > Sent: 2021年4月9日 0:57 > >> > To: FFmpeg development discussions and patches > <ffmpeg-devel@ffmpeg.org> > >> > Subject: Re: [FFmpeg-devel] [PATCH V7 4/6] lavu: add side data > >> > AV_FRAME_DATA_BOUNDING_BOXES > >> > > >> > >> First of all, thanks for the quick replies, I see, all the > discussions/comments are to > >> make this patch better, thank you. > >> > >> > >> > > >> > >> >> >> > + > >> > >> >> >> > +typedef struct AVBoundingBoxHeader { > >> > >> >> >> > + /** > >> > >> >> >> > + * Information about how the bounding box is > generated. > >> > >> >> >> > + * for example, the DNN model name. > >> > >> >> >> > + */ > >> > >> >> >> > + char source[128]; > >> > >> >> >> > + > >> > >> >> >> > + /** > >> > >> >> >> > + * The size of frame when it is detected. > >> > >> >> >> > + */ > >> > >> >> >> > + int frame_width; > >> > >> >> >> > + int frame_height; > >> > >> >> >> > > >> > >> >> >> > >> > >> >> >> Why? This side data is attached to AVFrames only, where we > >> > >> >> >> already have width and height. > >> > >> >> >> > >> > >> >> > > >> > >> >> > The detection result will be used by other filters, for example, > >> > >> >> > dnn_classify (see > https://github.com/guoyejun/ffmpeg/tree/classify). > >> > >> >> > > >> > >> >> > The filter dnn_detect detects all the objects (cat, dog, > person ...) in a > >> > >> >> > frame, while dnn_classify classifies one detected object (for > example, > >> > >> person) > >> > >> >> > for its attribute (for example, emotion, etc.) > >> > >> >> > > >> > >> >> > The filter dnn_classify have to check if the frame size is > changed after > >> > >> >> > it is detected, to handle the below filter chain: > >> > >> >> > dnn_detect -> scale -> dnn_classify > >> > >> >> > > >> > >> >> > >> > >> >> This doesn't look good. Why is dnn_classify needing to know > >> > >> >> the original frame size at all? > >> > >> >> > >> > >> > > >> > >> > For example, the original size of the frame is 100*100, and > dnn_detect > >> > >> > detects a face at place (10, 10) -> (30, 40), such data will be > >> > >> > saved > in > >> > >> > AVBoundingBox.top/left/right/bottom. > >> > >> > > >> > >> > Then, the frame is scaled into 50*50. > >> > >> > > >> > >> > Then, dnn_classify is used to analyze the emotion of the face, it > needs to > >> > >> > know the frame size (100*100) when it is detected, otherwise, it > does not > >> > >> > work with just (10,10), (30,40) and 50*50. > >> > >> > > >> > >> > >> > >> Why can't the scale filter also rescale this side data as well? > >> > >> > >> > > > >> > > I'm afraid that we could not make sure all such filters (including > filters in the > >> > > future) to do the rescale. And in the previous discussion, I got to > know that > >> > > 'many other existing side-data types are invalidated by scaling'. > >> > > > >> > > So, we need frame_width and frame_height here. > >> > > > >> > > >> > No, you don't. You just need to make sure filters which change > resolution > >> > or do cropping also change the side data parameters. > >> > It's called maintainership. As-is, this won't even work with cropping, > >> > only with basic aspect ratio preserving scaling. > >> > For the lack of a better term, this is a hack. > >> > >> As discussed in previous email, for the frame size change case, > dnn_classify > >> (and other filters which use the detection result, for example drawbox) > can > >> just output a warning message to tell user what happens, and don't do > the > >> classification, otherwise, it will give a wrong/weird result which makes > the > >> user confused. > >> > >> > > >> > I would accept just specifying that if the frame dimensions are > >> > altered in any way, the side-data is no longer valid and it's up > >> > to users to figure that out by out of bound coordinates. > >> > This is what we currently do with video_enc_params. > >> > >> frame_width/frame_height is not perfect (for the cases such as: scale > down > >> + crop + scale up to the same size), but it provides more info than the > checking > >> of 'out of bound coordinates'. There are many other possible issues > when the > >> coordinates are within the frame. > >> > >> If we think we'd better not let user get more info from the warning > message, > >> I'm ok to remove them. > >> > >> I'll remove them if there's another comment supporting the removal, > and > >> there's no objection. > >> > >> > > >> > > >> > >> >> >> > diff --git a/libavutil/frame.h b/libavutil/frame.h > >> > >> >> >> > index a5ed91b20a..41e22de02a 100644 > >> > >> >> >> > --- a/libavutil/frame.h > >> > >> >> >> > +++ b/libavutil/frame.h > >> > >> >> >> > @@ -198,6 +198,13 @@ enum AVFrameSideDataType { > >> > >> >> >> > * Must be present for every frame which should have > film grain > >> > >> applied. > >> > >> >> >> > */ > >> > >> >> >> > AV_FRAME_DATA_FILM_GRAIN_PARAMS, > >> > >> >> >> > + > >> > >> >> >> > + /** > >> > >> >> >> > + * Bounding boxes for object detection and > classification, the > >> > >> data is > >> > >> >> a > >> > >> >> >> AVBoundingBoxHeader > >> > >> >> >> > + * followed with an array of AVBoudingBox, and > >> > >> >> >> AVBoundingBoxHeader.nb_bboxes is the number > >> > >> >> >> > + * of array element. > >> > >> >> >> > + */ > >> > >> >> >> > + AV_FRAME_DATA_BOUNDING_BOXES, > >> > >> >> >> > }; > >> > >> >> >> > > >> > >> >> >> > >> > >> >> >> Finally, why call it a Bounding Box? It's not descriptive at > >> > >> >> >> all. > >> > >> >> >> How about "Object Classification"? It makes much more > sense, it's > >> > >> >> >> exactly what this is. So AVObjectClassification, > AVObjectClassification, > >> > >> >> >> AV_FRAME_DATA_OBJECT_CLASSIFICATION and so on. > >> > >> >> >> > >> > >> >> > > >> > >> >> > In object detection papers, bounding box is usually used. > >> > >> >> > We'd better use the same term, imho, thanks. > >> > >> >> > > >> > >> >> > >> > >> >> Not in this case, API users won't have any idea what this is or > what > >> > >> >> it's for. This is user-facing code after all. > >> > >> >> Papers in fields can get away with overloading language, but > we're > >> > >> >> trying to make a concise API. Object classification makes sense > >> > >> >> because this is exactly what this is. > >> > >> >> > >> > >> > > >> > >> > The term bounding box is dominating the domain, for example, > even > >> > >> > HEVC spec uses this term, see page 317 of > >> > >> > > >> > >> > >> > > https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-H.265-201911-I > !!P > >> > >> DF-E&type=items > >> > >> > > >> > >> > also copy some here for your convenient. > >> > >> > ar_bounding_box_top[ ar_object_idx[ i ] ] u(16) > >> > >> > ar_bounding_box_left[ ar_object_idx[ i ] ] u(16) > >> > >> > ar_bounding_box_width[ ar_object_idx[ i ] ] u(16) > >> > >> > ar_bounding_box_height[ ar_object_idx[ i ] ] u(16) > >> > >> > > >> > >> > I would prefer to use bounding box. > >> > >> > > >> > >> > >> > >> It's for an entirely different thing, and like I said, it's just an > overloaded > >> > >> language because they can get away. We're trying to be generic. > >> > >> This side data is for detecting _and_ classifying objects. In fact, > most of > >> > >> the structure is dedicated towards classifying. If you'd like users to > actually > >> > >> use this, give it a good name and don't leave them guessing what > this > >> > >> structure is by throwing vague jargon some other paper or spec > has > >> > >> because it's close enough. > >> > >> > >> > > > >> > > all the people in the domain accepts bounding box, they can > understand this > >> > > struct name easily and clearly, they might be bothered if we use > another > >> > name. > >> > > > >> > > btw, AVObjectClassification confuses people who just want object > detection. > >> > > > >> > > >> > As I said, the name "bounding box" makes no sense once it gets > overloaded > >> > with object classification. > >> > >> dnn_detect creates an array of 'bounding box' for all detected objects, > and > >> dnn_classify assigns attributes for a set of bounding boxes (with same > object > >> id). 'bounding box' serves both detection and classification properly. > >> > >> > >> > Object classification is still the main use of the filters, > >> > because the original proposal was to have all of this info be > ffmpeg-private, > >> > which would forbid simple object detection. > >> > >> The original proposal is to add it as side data which is ffmpeg-public, > and then, > >> we spent much time discussing/trying with ffmpeg-private as an > temporary > >> method, and since it is not good to be temporary, we now switch back > to > >> ffmpeg-public. > >> > >> During the whole period, we don't have any intention to > >> 'forbid simple object detection', not quite understand your point here. > >> > >> > >> > So I still maintain this should be called "Object classification". I'd > accept > >> > "Object detection" as well, but definitely not "bounding box". > >> > >> imho, ' Object detection' and ' Object classification' are worse, they just > >> describe one aspect of the struct. The users might just use filter > dnn_detect, > >> they might use filters dnn_detect + dnn_classify. > >> > >> > >> > > >> > Since the decision was made to make the side data public, we have to > make > >> > very sure it contains no hacks or is impossible to extend, since we > don't want > >> > to have an > >> > > "AV_SIDE_DATA_OBJECT_CLASSIFICATION_VERSION_2_SORRY_WE_SCREWE > D_ > >> > UP" > >> > faster than you can say "Recursive cascade correlation artificial neural > >> > networks". > >> > >> sorry, not quite understand your point here. > >> > >> 'bounding box' is designed for general purpose to contain the info for > >> detection/classification. It doesn't matter which DNN model is used, it > doesn't > >> matter if a traditional algorithm (non-dnn) is used. > >> > >> I'm open to use a better name. And bounding box is the best one for > me till now. > >> Everyone in the domain knows the exact meaning of bounding box > without > >> extra explanation. This word has been extended/evolved with such > meaning in > >> this domain. > >> > > +1 > > > > I think it is wise to use the name which is widely used in the field. > > > > Way to go, ignoring half the exchange to voice an opinion after we > came to an agreement.
Some people take days for a thorough thinking and then give comment. It's nice and welcome. I even received some comments in the last day before pushing. I might be too busy for preparing the next version patch (we have some video analytic feature codes/plans depending on this one, and hope to move things forward better). I need to wait for several more days to get more comments. My point is that bounding box is a good (might the best) name if we have object detection/classification background, it might be not the best name without such background. Anyway, I'm open to use a better name. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".