On 08.02.22 16:02, Saul Wold wrote:
> This patch will read the begining of source files and try to find
> the SPDX-License-Identifier to populate the licenseInfoInFiles
> field for each source file. This does not populate licenseConcluded
> at this time, nor rolls it up to package level.
> 
> We read as binary file since some source code seem to have some
> binary characters, the license is then converted to ascii strings.
> 
> Signed-off-by: Saul Wold <saul.w...@windriver.com>
> ---
> v2: Clean up commit message
> v3: Really fix up regex based on Peter's feedback!
> 
>   meta/classes/create-spdx.bbclass | 22 ++++++++++++++++++++++
>   1 file changed, 22 insertions(+)
> 
> diff --git a/meta/classes/create-spdx.bbclass 
> b/meta/classes/create-spdx.bbclass
> index 8b4203fdb5..64aada8593 100644
> --- a/meta/classes/create-spdx.bbclass
> +++ b/meta/classes/create-spdx.bbclass
> @@ -37,6 +37,23 @@ SPDX_SUPPLIER[doc] = "The SPDX PackageSupplier field for 
> SPDX packages created f
>   
>   do_image_complete[depends] = "virtual/kernel:do_create_spdx"
>   
> +def extract_licenses(filename):
> +    import re
> +
> +    lic_regex = re.compile(b'^\W*SPDX-License-Identifier:\s*([ 
> \w\d.()+-]+?)(?:\s+\W*)?$', re.MULTILINE)

Taking inspiration from reuse-tool 
(https://github.com/fsfe/reuse-tool/blob/master/src/reuse/_comment.py) 
and the way they parse comment blocks the results with the updated regex 
look good.

Test sample set:
(* SPDX-License-Identifier: Foo-Bar *)
(* SPDX-License-Identifier: Foo-Bar *)
/* SPDX-License-Identifier: Foo-Bar */
<!-- SPDX-License-Identifier: Foo-Bar -->
<#-- SPDX-License-Identifier: Foo-Bar -->
<%-- SPDX-License-Identifier: Foo-Bar --%>
{# SPDX-License-Identifier: Foo-Bar #}
{/* SPDX-License-Identifier: Foo-Bar */}
{{!-- SPDX-License-Identifier: Foo-Bar --}}
@Comment{ SPDX-License-Identifier: Foo-Bar } ---> Only this one is 
missed (which is bibtex syntax) - no idea if that is of importance for 
anyone.

Just wanted to highlight that this is not catching every possible input line


> +
> +    try:
> +        with open(filename, 'rb') as f:
> +            size = min(15000, os.stat(filename).st_size)
> +            txt = f.read(size)
> +            licenses = re.findall(lic_regex, txt)
> +            if licenses:
> +                ascii_licenses = [lic.decode('ascii') for lic in licenses]
> +                return ascii_licenses
> +    except Exception as e:
> +        bb.warn(f"Exception reading {filename}: {e}")
> +    return None
> +
>   def get_doc_namespace(d, doc):
>       import uuid
>       namespace_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, 
> d.getVar("SPDX_UUID_NAMESPACE"))
> @@ -232,6 +249,11 @@ def add_package_files(d, doc, spdx_pkg, topdir, 
> get_spdxid, get_types, *, archiv
>                           checksumValue=bb.utils.sha256_file(filepath),
>                       ))
>   
> +                if "SOURCE" in spdx_file.fileTypes:
> +                    extracted_lics = extract_licenses(filepath)
> +                    if extracted_lics:
> +                        spdx_file.licenseInfoInFiles = extracted_lics
> +
>                   doc.files.append(spdx_file)
>                   doc.add_relationship(spdx_pkg, "CONTAINS", spdx_file)
>                   spdx_pkg.hasFiles.append(spdx_file.SPDXID)
> 
> 
> 
> 
> 
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#161515): 
https://lists.openembedded.org/g/openembedded-core/message/161515
Mute This Topic: https://lists.openembedded.org/mt/88997967/21656
Group Owner: openembedded-core+ow...@lists.openembedded.org
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to