Re: [OE-core] [PATCH v3] create-spdx: Get SPDX-License-Identifier from source

Konrad Weihmann Tue, 08 Feb 2022 08:20:02 -0800


On 08.02.22 16:50, Saul Wold wrote:
> 
> 
> On 2/8/22 07:37, Konrad Weihmann wrote:
>> On 08.02.22 16:02, Saul Wold wrote:
>>> This patch will read the begining of source files and try to find
>>> the SPDX-License-Identifier to populate the licenseInfoInFiles
>>> field for each source file. This does not populate licenseConcluded
>>> at this time, nor rolls it up to package level.
>>>
>>> We read as binary file since some source code seem to have some
>>> binary characters, the license is then converted to ascii strings.
>>>
>>> Signed-off-by: Saul Wold <saul.w...@windriver.com>
>>> ---
>>> v2: Clean up commit message
>>> v3: Really fix up regex based on Peter's feedback!
>>>
>>>    meta/classes/create-spdx.bbclass | 22 ++++++++++++++++++++++
>>>    1 file changed, 22 insertions(+)
>>>
>>> diff --git a/meta/classes/create-spdx.bbclass 
>>> b/meta/classes/create-spdx.bbclass
>>> index 8b4203fdb5..64aada8593 100644
>>> --- a/meta/classes/create-spdx.bbclass
>>> +++ b/meta/classes/create-spdx.bbclass
>>> @@ -37,6 +37,23 @@ SPDX_SUPPLIER[doc] = "The SPDX PackageSupplier 
>>> field for SPDX packages created f
>>>    do_image_complete[depends] = "virtual/kernel:do_create_spdx"
>>> +def extract_licenses(filename):
>>> +    import re
>>> +
>>> +    lic_regex = re.compile(b'^\W*SPDX-License-Identifier:\s*([ 
>>> \w\d.()+-]+?)(?:\s+\W*)?$', re.MULTILINE)
>>
>> Taking inspiration from reuse-tool
>> (https://github.com/fsfe/reuse-tool/blob/master/src/reuse/_comment.py)
>> and the way they parse comment blocks the results with the updated regex
>> look good.
>>
> I was not aware of this parser.


It was advertised as **the** reference implementation of SPDX-Lic 
scanning a while back - so I would consider it good to get "inspired"

> 
>> Test sample set:
>> (* SPDX-License-Identifier: Foo-Bar *)
>> (* SPDX-License-Identifier: Foo-Bar *)
>> /* SPDX-License-Identifier: Foo-Bar */
>> <!-- SPDX-License-Identifier: Foo-Bar -->
>> <#-- SPDX-License-Identifier: Foo-Bar -->
>> <%-- SPDX-License-Identifier: Foo-Bar --%>
>> {# SPDX-License-Identifier: Foo-Bar #}
>> {/* SPDX-License-Identifier: Foo-Bar */}
>> {{!-- SPDX-License-Identifier: Foo-Bar --}}
>> @Comment{ SPDX-License-Identifier: Foo-Bar } ---> Only this one is
>> missed (which is bibtex syntax) - no idea if that is of importance for
>> anyone.
> 
>> Just wanted to highlight that this is not catching every possible 
>> input line
>>
> Do we need to pull in the complexity of the reuse-tool comment parser? 
> Let me know, might not make 3.5 if this is the case.

For me it this seems to be a good best effort approach and as long as no 
one claims that the output is technically a 100% correct, I'm good with it.
If one would want to improve it, I would say the usage of an external 
tokenizer like pygments would be good - extracting all the comments from 
a beginning of a file and then filtering by regex.
For now the implementation might miss out on a few edge cases (like the 
syntax I highlighted or comment blocks longer than 15k chars) but that 
is totally fine with me.

> 
> Sau!
> 
>>
>>> +
>>> +    try:
>>> +        with open(filename, 'rb') as f:
>>> +            size = min(15000, os.stat(filename).st_size)
>>> +            txt = f.read(size)
>>> +            licenses = re.findall(lic_regex, txt)
>>> +            if licenses:
>>> +                ascii_licenses = [lic.decode('ascii') for lic in 
>>> licenses]
>>> +                return ascii_licenses
>>> +    except Exception as e:
>>> +        bb.warn(f"Exception reading {filename}: {e}")
>>> +    return None
>>> +
>>>    def get_doc_namespace(d, doc):
>>>        import uuid
>>>        namespace_uuid = uuid.uuid5(uuid.NAMESPACE_DNS, 
>>> d.getVar("SPDX_UUID_NAMESPACE"))
>>> @@ -232,6 +249,11 @@ def add_package_files(d, doc, spdx_pkg, topdir, 
>>> get_spdxid, get_types, *, archiv
>>>                            checksumValue=bb.utils.sha256_file(filepath),
>>>                        ))
>>> +                if "SOURCE" in spdx_file.fileTypes:
>>> +                    extracted_lics = extract_licenses(filepath)
>>> +                    if extracted_lics:
>>> +                        spdx_file.licenseInfoInFiles = extracted_lics
>>> +
>>>                    doc.files.append(spdx_file)
>>>                    doc.add_relationship(spdx_pkg, "CONTAINS", spdx_file)
>>>                    spdx_pkg.hasFiles.append(spdx_file.SPDXID)
>>>
>>>
>>>
>>> 
>>>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#161522): 
https://lists.openembedded.org/g/openembedded-core/message/161522
Mute This Topic: https://lists.openembedded.org/mt/88997967/21656
Group Owner: openembedded-core+ow...@lists.openembedded.org
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [OE-core] [PATCH v3] create-spdx: Get SPDX-License-Identifier from source

Reply via email to