Skip to content

GSOC 2022

Philippe Ombredanne edited this page Feb 24, 2022 · 28 revisions

This page contains information for aspiring interested in participating and helping with the GSoC 2022 program.

AboutCode is a family of FOSS projects to uncover data ... about software code:

  • where does the code come from? which software package?
  • what is its license? copyright?
  • is the code vulnerable, maintained, well coded?

All these are questions that are important to answer: there are millions of free and open source software components available on the web for reuse.

Knowing where a software package comes from, what its license is and whether it is vulnerable should be a problem of the past such that everyone can safely consume more free and open source software.

Join us to make it so!

Our tools are used to help detect and report the origin and license of source code, packages and binaries as well as discover software and package dependencies, and in the future track security vulnerabilities, bugs and other important software package attributes. This is a suite of command line tools, web-based and API servers and desktop applications.

Table of Contents

AboutCode projects are...

  • Scancode.io is a web-based and API to run and review scans in rich scripted ScanPipe pipelines.

  • ScanCode Toolkit is a popular command line tool to scan code for licenses, copyrights and packages, used by many organizations and FOSS projects, small and large.

  • VulnerableCode is a web-based API and database to collect and track all the known software package vulnerabilities.

  • Scancode Workbench is a JavaScript, Electron-based desktop application to review scan results and document your origin and license conclusions.

  • AboutCode Toolkit is a command line tool to document and inventory known packages and licenses and generate attribution docs, typically using the results of analyzed and reviewed scans.

  • TraceCode Toolkit is a command line tool to find which source code file is used to create a compiled binary by tracing and graphing a build.

  • DeltaCode is a command line tool to compare scans and determine if and where there are material differences that affect licensing.

  • FetchCode is a library to reliably fetch any code via HTTP, FTP and version control systems such as git.

  • container-inspector is a command line tool to analyze the code in Docker and container images and a low-level library to handle this

  • license-expression is a library to parse, analyze, simplify and render boolean license expression (such as SPDX)

We have also co-founded and contributed to important projects for other organizations:

  • Package URL which is an emerging standard to reference software packages of all types with simple, readable and concise URLs.

  • SPDX aka. Software Package Data Exchange, a spec to document the origin and licensing of packages.

  • ClearlyDefined to review and help FOSS projects improve their licensing and documentation clarity.

Contact

Join the chat online at https://gitter.im/aboutcode-org/discuss (or by IRC or matrix) Introduce yourself and start the discussion!

For personal issues, you can contact the primary org admin directly: @pombredanne and [email protected]

Please try asking questions the smart way: http://www.catb.org/~esr/faqs/smart-questions.html

Technology

Discovering the origin, license and security of code is a vast topic. We primarily use Python with some C/C++ , Rust and Go for performance sensitive code. We use Django and PostgreSQL for web apps and API servers. We use Electron and JavaScript for our ScanCode Workbench.

Our domain includes text analysis and processing (for instance for copyrights and licenses detection), parsing (for package manifest formats), binary analysis (to detect the origin and license of binaries, primarily based on the corresponding source code), Web-based tools and APIs (to expose the tools and libraries as Web Services) and low-level data structures for efficient matching (such as high performance string search automatons).

Skills

Incoming students will need the following skills:

  • Intermediate to strong Python programming. For some projects, strong C/C++, Go or Rust may be needed.
  • Familiarity with git as a version control system. Take the time to learn git!
  • Ability to set up your own development environment
  • An interest in open source security, licensing and generally software composition analysis.

We are happy to help you get up to speed, but the more you are able to demonstrate ability and skills in advance, the more likely we are to choose your application!

About your project application

We expect your application to be in the range of 1000 words. Anything less than that will probably not contain enough information for us to determine whether you are the right person for the job. Your proposal should contain at least the following information, plus anything you think is relevant:

  • Your name

  • Title of your proposal

  • Abstract of your proposal

  • Detailed description of your idea including explanation on why is it innovative and what it will contribute to the project

  • hint: explain your data structures and you planned main processing flows in details.

  • Description of previous work, existing solutions (links to prototypes, bibliography are more than welcome)

  • Mention the details of your academic studies, any previous work, internships

  • Relevant skills that will help you to achieve the goal (programming languages, frameworks)?

  • Any previous open-source projects (or even previous GSoC) you have contributed to and links.

  • Do you plan to have any other commitments during GSoC that may affect your work? Any vacations/holidays? Will you be available full time to work on your project? (Hint: do not bother applying if this is not a serious main time commitment during the GSoC time frame)

Join the chat online or by IRC at https://gitter.im/aboutcode-org/discuss introduce yourself and start the discussion!

The best way to demonstrate your capability would be to submit a small patch ahead of the project selection for an existing issue or a new issue.
We will always consider and prefer a project submissions where you have submitted a patch over any other submission without a patch.

You can pick any project idea from the list below. If you have other ideas that are not in this list, contact the team first to make sure it makes sense.

Our Project ideas

Here is a list of candidate project ideas for your consideration. Your own ideas are welcomed too! Please chat about them to get early feedback!

NOTE: these ideas are not sorted in a specific important and priority order... we are working to improve this.


ScanCode.io: web-based automated Conclusions app and GUI review app

This project is to create a new web application in ScanCode.io to help reach conclusions on an analysis project wrt. the origin, license or vulnerabilities of a codebase. This is an important project that comprise:

  • design the data models for conclusions
  • create a mini framework to run "bots" that can automate reaching "conclusions" on licensing and origin including spotting issues and exceptions
  • create the UI to visualize these conclusions and eventually update them by hand

ScanCode.io: external storage and archival of scanned code.

This project should extend ScanCode.io such that it can use external storage for the scanned code. The problem is that when you run a large number of projects the volume of storage that is used in ScanCode.io grows a lot. For this we can now archive projects, but we cannot archive the corresponding code that was scanned. The goal of this project is to add a new option in ScanCode.io to also archive to some blob storage the code that was scanned such that:

  • this can be done at the same time a project is archived

  • it can be possible to restore from this archival a state that is essentially the same as the original project state in terms of files and data


ScanCode.io: pluggable advanced and extended pipelines with custom data models and UI.

This project should create a new framework for advanced ScanCode.io pipelines such that it becomes possible to:

  • include pluggable new data models specific to a pipeline (for instance to store the debug symbols found in a binary file)
  • add pluggable UI for a pipeline that would include ways to navigate the data models
  • add pluggable reporting for a pipeline that would include standard reports

As a practical implmentation, this project should implement a concrete UI and extension to store and display extended information for Docker images and VM image projects such as the OS, FS and layer details (displayed today as simple plain text)


ScanCode.io: create a system and web UI to scan ALL the packages from Debian and fix and review all of them

This project would become a prototype to help scan and curate the package licensing of a specific ecosystem. It would include:

  • specific pipelines tuned to collect lists of all the packages and organize the scans of these correctly
  • specific UI to visualize the queue of scan projects
  • specific libraries to detect common licensing issues of this package type
  • a UI to orgnizae the community/peer review of all these package scans and issues
  • extension to create reports and update the package type manifests (here Debian machine readable copyright files)

-See also Create web application for massive scanning campaign of a whole package ecosystem


ScanCode.io: Add web service for software package and project evaluations and comparisons (djangopackages-like)

This project would build on the djangopackages/opencomparison code to provide:

  • a general purpose and easy way to create and share package comparison grids
  • their scanning integration in ScanCode.io

ScanCode.io: Add SBOM outputs for SPDX and CycloneDX

This project would add SPDX and CycloneDX reporting options to a ScanCode.io project.

ScanCode.io: Improve the web UI experience in SCIO

We have limited ways to navigate the data in ScanCode.io The goal of this project is to improve the UI in several areas and in particular:

  • enabling better linking to resource details from the graphics view
  • provide streamlined simpler resource views that only display the important data and have fewer details (but still provide ways to drill down)
  • improve the way match details are visualized in the a single resource page such that which license and which copyright where detected where is more obvious and the actual license scoring is

In particular, this project could add a new pipeline for integration with external matching services This would include tool such as SoftwareHeritage or Scanoss and orther Component or package identification integration. The goal would be to create "pipes" and an improved package scanning pipeline that would include matching.


Create a web app and JSON Rest API to detect any text for license, and submit bugs (SCTK-aaS).

This project is to create a web-based mini application (possibly plugged in ScanCode.io as an app) that would receive a text as an input and run ScanCode Toolkit license and copyright detection on this text. It would display the results such that the matched texts and all matching details are easy to understand and make it visually obvious what was matched and how it was matched. It should also allow to easily provide feedback if the detection is not correct, and suggest a better detection and possibly the creation of a new lciense detection rule. It should also integrate and run the "scancode-analyzer" to find possible issues automatically. Finally it should allow the report of a license detection easily and integrated in the app based on the results.

See also: Prototype a license detection view for scancode.io



ScanCode.io/VulnerableCode: CI integrations:

Create a CI integrations which would scan the codebase for packages using SBOM tools like scancode-toolkit. Then verify whether each of the package is safe. Implement a Github action, jenkins plugins that does this .


TraceCode/ ScanCode Toolkit/ScanCode.io: Source to binary reverse engineering

This project is about the integration of multiple existing plugins and tools with a singular to find which source code used to create a compiled using symbols, debug symbols, strings or more.


ScanCode Workbench: Improve the ScanCode Workbench

There are several tasks to perform there and in particular:

  • upgrade to the latest version of all the packages
  • possibly switch to using TypeScript rather than plain JS
  • remove the conclusions module
  • Integrate with the ScanCode.io API
  • Consider alternative UI such as the Opposum UI as a possible merger path
  • See also Improve Workbench Table View

ScanCode Toolkit: License Language Server Protocol server for IDE integration

This project would implement a Language Server Protocol server for license and copyright that would be usng ScanCode toolkit and provide live license and copyright feedback directly in IDEs. It would also provide a plugin for integration in at least one IDE such Atom, VSCode or Eclipse.


ScanCode Toolkit: Detect false positive licenses and other license detection anomalies.

This project would design and implment new ways to detect and filter possible ScanCode false positive license detections, possibly using AI and machine learning. A first attempt exists with scancode-analyzer and would be furthered. There are also heuristics or rule-based approached that could be created.


This is to have faster license and copyright detection using less memory.





ScanCode Toolkit/ScanCode.io: Create GitHub SBOM creation action(s)



ScanCode Toolkit: Ingest CycloneDX and SPDX in ScanCode Toolkit

This project would add support to collect and parse the data from CycloneDX and SPDX SBOMs in a ScanCode Toolkit scan. This would likely mean to treat these as "package data"


VulnerableCode: Create a purl "virtual" database, library and service.

A key attraction of VulnerableCode is its built-in support for purl. The goal of this project is to make purl more accessible and visible and:

  • enhance the purl2url and url2purl support of the packageurl Python library such that it can process more common package types
  • enhance the packageurl Python library to convert more purl-like data to purl and in particular the OSV format, the new NVD 5.0 reference, the ORT coordinates, etc.
  • enhance the purl2cpe VulnerableCode utility such that it can process more cases to create better purls. Create script to publish of a continuously updated repository with the purl2pce data.
  • expose a url2purl API service in VulnerableCode to help create correct purls
  • expose a purl2url API service in VulnerableCode to help return a list of URLs given a purl.
  • publish

VulnerableCode: Decentralized vulnerability dataupload and share, and synchronization


VulnerableCode: Cross-validation of VulnerableCode coverage of vulnerabilites against other tools: the vulntotal project!

There are several "free" vulnerability check tools and services such as OWASP, Dependency check and Dependency track, deps.dev, osv.dev, Sonatype OSS Index, Gitlab Vulnearbility DB, Github Dependabot, and more (Snyk, BlackDuck, WS, etc.) The goal of this project would be to cross-validate against these services and DB, virustotal style to report if a queried package/version is found as vulnerable.


VulnerableCode: Chrome and Firefox extension to support browsing Package URL. Browsing pkg:pypi/packageurl-python/ should go to https://pypi.org/project/packageurl-python/

And create/register a Duck Duck Go bang mapper for https://github.com/package-url/packageurl-python/blob/main/src/packageurl/contrib/purl2url.py Also add multiple URLs in purl2url.py


VulnerableCode: return SPDX or CycloneDX report for VEX (vex: vulnerability exploitability)


VulnerableCode: Integration with other aboutcode tools :

This includes integration with other aboutcode tools, namely scancode io and scancode-toolkit . At a higher level these tools detect all the packages used by a codebase. They will then query VulnerableCode and verify whether each of the found package is vulnerable or not . See ticket at scancode io


VulnerableCode: Add more data sources and mine the graph to find correlations between vulnerabilities

See https://github.com/nexB/vulnerablecode#how for background info. We want to search for more vulnerability data sources and consume them. There is a large number of pending tickets for data sources. See https://github.com/nexB/vulnerablecode/issues?q=is%3Aissue+is%3Aopen+label%3A"Data+collection"


VulnerableCode: Mining issues, CHANGELOG and commit logs for vulnerabilities:

The project would be to provide a way to effectively mine issues (such as GitHub issues) for possible unreported vulnerabilities. For a start this should be focused on a few prominent repos. And also find FXI Commits This could use NLP and machine learning to "understand" vulnerability descriptions: Often security advisories do not provide structured information on which package and package versions are vulnerable. We could create a system which would infer vulnerable package name and version(s) by parsing the vulnerability description using natural language processing techniques and heuristics.


VulnerableCode: Vulnerability code scanners (e.g. static code analysis):

Create scanners which would verify whether a codebase is vulnerable to a vulnerability. Once we know that a vulnerable package is in use, a scanner could check for whether the vulnerable code is called, or if environmental conditions or configuration are conducive to the vulnerability, etc. This could be based on yara rules, OpenVAS or similar. Or based on Eclipse Steady and deeper code analysis, static or dynamic.


VulnerableCode: Exploits references as alternative to scanners.

We can collect exploits and PoC that can verify whether a codebase is vulnerable to a given package vulnerability. Once we know that a vulnerable package is in use, the exploit could be to check for whether the vulnerable code is effectively being called.


VulnerableCode: Create a Vulnerability review Workbench

We could add UI components that would enable reviewers to triage, refine, improve and curate vulnerability data. This could include linking and displaying refe


VulnerableCode: Create a UI to browse and query vulnerabilities and vulnerable packages


Fetchcode Project Automatically fetch code archives given a purl





Univers: Validate that the univers library can handle all the versions and ranges of all the packages!

Project(s) in this domain would consist in building test suites and fix them for all the versions in univers. Practically this means to download all the version of all the packages of an ecosystem (for instance PyPI) and validate that we can compare the version as good as the package management tool of reference for this ecosystem. For instance in alpine, https://git.alpinelinux.org/apk-tools/tree/test/version.sh?h=v2.12.9

Some specific highlights would cover:

  • writing code that can collect the list of all the versions of all the packages in a given package ecosystem (for instance PyPI, npm,etc). This code would be likely in FetchCode or ScanCode Toolkit packagedcode module. This could be extended to collect all the version ranges.

  • write an automated test harness to ensure that the univers library can properly parse (and unparse) all the versions and version ranges of all the packages.

  • write an automated test harness to ensure that the univers library can properly sort all the versions of each package in an ecosystem.

  • Update the univers library accordingly and create a unit test suite as needed


This project(s) would be to create vers spec implementation in many programming languages (aka. port the "univers" python library to other languages)


CommonCode: Package name and version inference from a file name: get package name and version reliably

This project would provide a more reliable way to infer a package name and version from a package archive name. For instance the simple cases of "log4j-1.2.3.jar" could yield type:maven, name:log4j, version:1.2.3 Existing regex-based code in commoncode at https://github.com/nexB/commoncode/blob/main/src/commoncode/version.py is a bit complex to maintain. The project could possibly use some machine learning. In all case part of the project is to collect a test dataset of a large number of released archives names from various sources (sf.net, SWH, Debian, Fedora) to use as test (and possibly training set for ML)


In search of popularity and prominence metric for software packages https://github.com/nexB/aboutcode/wiki/Project-Ideas-Project-popularity

Clone this wiki locally