On 08.02.22 16:50, Saul Wold wrote: > > > On 2/8/22 07:37, Konrad Weihmann wrote: >> On 08.02.22 16:02, Saul Wold wrote: >>> This patch will read the begining of source files and try to find >>> the SPDX-License-Identifier to populate the licenseInfoInFiles >>> field for each source file. This does not populate licenseConcluded >>> at this time, nor rolls it up to package level. >>> >>> We read as binary file since some source code seem to have some >>> binary characters, the license is then converted to ascii strings. >>> >>> Signed-off-by: Saul Wold <saul.w...@windriver.com> >>> --- >>> v2: Clean up commit message >>> v3: Really fix up regex based on Peter's feedback! >>> >>> meta/classes/create-spdx.bbclass | 22 ++++++++++++++++++++++ >>> 1 file changed, 22 insertions(+) >>> >>> diff --git a/meta/classes/create-spdx.bbclass >>> b/meta/classes/create-spdx.bbclass >>> index 8b4203fdb5..64aada8593 100644 >>> --- a/meta/classes/create-spdx.bbclass >>> +++ b/meta/classes/create-spdx.bbclass >>> @@ -37,6 +37,23 @@ SPDX_SUPPLIER[doc] = "The SPDX PackageSupplier >>> field for SPDX packages created f >>> do_image_complete[depends] = "virtual/kernel:do_create_spdx" >>> +def extract_licenses(filename): >>> + import re >>> + >>> + lic_regex = re.compile(b'^\W*SPDX-License-Identifier:\s*([ >>> \w\d.()+-]+?)(?:\s+\W*)?$', re.MULTILINE) >> >> Taking inspiration from reuse-tool >> (https://github.com/fsfe/reuse-tool/blob/master/src/reuse/_comment.py) >> and the way they parse comment blocks the results with the updated regex >> look good. >> > I was not aware of this parser.
It was advertised as **the** reference implementation of SPDX-Lic scanning a while back - so I would consider it good to get "inspired" > >> Test sample set: >> (* SPDX-License-Identifier: Foo-Bar *) >> (* SPDX-License-Identifier: Foo-Bar *) >> /* SPDX-License-Identifier: Foo-Bar */ >> <!-- SPDX-License-Identifier: Foo-Bar --> >> <#-- SPDX-License-Identifier: Foo-Bar --> >> <%-- SPDX-License-Identifier: Foo-Bar --%> >> {# SPDX-License-Identifier: Foo-Bar #} >> {/* SPDX-License-Identifier: Foo-Bar */} >> {{!-- SPDX-License-Identifier: Foo-Bar --}} >> @Comment{ SPDX-License-Identifier: Foo-Bar } ---> Only this one is >> missed (which is bibtex syntax) - no idea if that is of importance for >> anyone. > >> Just wanted to highlight that this is not catching every possible >> input line >> > Do we need to pull in the complexity of the reuse-tool comment parser? > Let me know, might not make 3.5 if this is the case. For me it this seems to be a good best effort approach and as long as no one claims that the output is technically a 100% correct, I'm good with it. If one would want to improve it, I would say the usage of an external tokenizer like pygments would be good - extracting all the comments from a beginning of a file and then filtering by regex. For now the implementation might miss out on a few edge cases (like the syntax I highlighted or comment blocks longer than 15k chars) but that is totally fine with me. > > Sau! > >> >>> + >>> + try: >>> + with open(filename, 'rb') as f: >>> + size = min(15000, os.stat(filename).st_size) >>> + txt = f.read(size) >>> + licenses = re.findall(lic_regex, txt) >>> + if licenses: >>> + ascii_licenses = [lic.decode('ascii') for lic in >>> licenses] >>> + return ascii_licenses >>> + except Exception as e: >>> + bb.warn(f"Exception reading {filename}: {e}") >>> + return None >>> + >>> def get_doc_namespace(d, doc): >>> import uuid >>> namespace_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, >>> d.getVar("SPDX_UUID_NAMESPACE")) >>> @@ -232,6 +249,11 @@ def add_package_files(d, doc, spdx_pkg, topdir, >>> get_spdxid, get_types, *, archiv >>> checksumValue=bb.utils.sha256_file(filepath), >>> )) >>> + if "SOURCE" in spdx_file.fileTypes: >>> + extracted_lics = extract_licenses(filepath) >>> + if extracted_lics: >>> + spdx_file.licenseInfoInFiles = extracted_lics >>> + >>> doc.files.append(spdx_file) >>> doc.add_relationship(spdx_pkg, "CONTAINS", spdx_file) >>> spdx_pkg.hasFiles.append(spdx_file.SPDXID) >>> >>> >>> >>> >>> >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#161522): https://lists.openembedded.org/g/openembedded-core/message/161522 Mute This Topic: https://lists.openembedded.org/mt/88997967/21656 Group Owner: openembedded-core+ow...@lists.openembedded.org Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-