Skip to content

GSOC 2024 Project Ideas

Ayan Sinha Mahapatra edited this page Jan 22, 2024 · 8 revisions

See our page on applying for GSoC 2024: https://github.com/nexB/aboutcode/wiki/GSOC-2024

Table of Contents


Here is a list of candidate project ideas for your consideration. Your own ideas are welcomed too! Please chat about them to get early feedback!

Project Ideas Index

PURLdb:

Vulnerablecode:

scancode.io:

Other Project Ideas: https://github.com/nexB/aboutcode/wiki/Archived-GSoC-Project-Ideas


PURLdb project ideas


PURLdb: Add UI and deploy a live public server

Repository: https://github.com/nexB/purldb

Project code: https://github.com/nexB/purldb/tree/main/purldb

Size: Large

Difficulty Level: Intermediate

Tags: [Django], [PostgreSQL], [Javascript], [Web], [UI], [LiveServer]

Mentors:

  • @jyang
  • @tdruez
  • @pombredanne
  • @tg1999

Related Issues:

Description:

There are two tasks here:

  1. Add UI:

Add a basic django UI for the project supporting queary by packages, scanning and matching. We would be heavily reusing elements from scancode.io and vulnerabelcode to give it the same look and feel.

  1. Deploy a public server similar to https://public.vulnerablecode.io/ as a demo with packageDB data. See https://github.com/nexB/vulnerablecode for reference.

PURLdb/ScanCode.io: Enrich an SBOM based on OSSF Security Score Card

Repository: https://github.com/nexB/purldb and https://github.com/nexB/scancode.io

Reference: https://github.com/ossf/scorecard

Size: Medium

Difficulty Level: Intermediate

Tags: [Django], [PostgreSQL], [SBOM], [Metadata], [Security]

Mentors:

  • @jyang
  • @tdruez
  • @pombredanne
  • @tg1999
  • @AyanSinhaMahapatra

Related Issues:

Description:

We already have SBOM export (and import) options in scancode.io supporting SPDX and CycloneDX SBOMs, and we can enrich this data using the public https://github.com/ossf/scorecard#public-data or the RestAPI at: https://api.securityscorecards.dev/.

The specific tasks for this project are:

  • Research and figure out how best to consume this data
  • Map this data to SPDX/CycloneDX SBOM elements i.e. how it can be exported in a BOM
  • Use this in a pipeline in scancode.io AND/OR have this as an element in packageDB

PURLdb: Create relationships between source and binary packages

Repository: https://github.com/nexB/purldb

Project code: https://github.com/nexB/purldb/tree/main/matchcode

Size: Large

Difficulty Level: Advanced

Tags: [Django], [PostgreSQL], [BinaryAnalysis], [Metadata], [Security]

Mentors:

  • @pombredanne
  • @mjherzog
  • @chinyeungli

Description:

Here the proposed functionality is of matching files between a source code tree for a given package and mapping these files to the binary created for this package in order to apply license/metadata/conclusions obtained from the source package scans to the binary and analyze it's license/attribution obligations.

An example could be a pipeline for end to end Java app binaries reverse engineering. This project can also be realized similarly for the following ecosystems:

  • JavaScript
  • Android
  • iOS
  • Go

VulnerableCode project ideas

There are two main categories of projects for VulnerableCode:

  • A. COLLECTION: this category is to mine and collect or infer more new and improved data. This includes collecting new data sources, inferring and improving existing data or collecting new primary data (such as finding a fix commit of a vulnetrability)

  • B. USAGE: this category is about using and consuming the vulnerability database and includes the API proper, the GUI, the integrations, and data sharing, feedback and curation.


VulnerableCode: Process unstructured data sources for vulnerabilities (Category A)

Repository: https://github.com/nexB/vulnerablecode

Reference: https://github.com/nexB/vulnerablecode/issues/251

Size: Large

Difficulty Level: Advanced

Tags: [Python], [Django], [PostgreSQL], [Security], [Vulneribility], [NLP]

Mentors:

  • @pombredanne
  • @tg1999
  • @keshav-space
  • @Hritik14
  • @AyanSinhaMahapatra

Related Issues:

Description:

The project would be to provide a way to effectively mine unstructured data sources for possible unreported vulnerabilities.

For a start this should be focused on a few prominent repos. This project could also find Fix Commits.

Some sources are:

  • mailing lists
  • changelogs
  • reflogs of commit
  • bug and issue trackers

This requires systems to "understand" vulnerability descriptions: as often security advisories do not provide structured information on which package and package versions are vulnerable. The end goal is creating a system which would infer vulnerable package name and version(s) by parsing the vulnerability description using specialised techniques and heuristics.

We can either use NLP/machine Learning and automate it all, potentially training data masking algorithms to find these specific data (this also involved creating a dataset) but that's going to be super difficult.

We could also start to craft a curation queue and parse as much as we can to make it easy to curate by humans and progressively also improve some mini NLP models and classification to help further automate the work.


VulnerableCode: Add more data sources and mine the graph to find correlations between vulnerabilities (Category A)

Repository: https://github.com/nexB/vulnerablecode

Reference: https://github.com/nexB/vulnerablecode/issues?q=is%3Aissue+is%3Aopen+label%3A"Data+collection"

Size: Large

Difficulty Level: Intermediate

Tags: [Django], [PostgreSQL], [Security], [Vulneribility], [API], [Scraping]

Mentors:

  • @pombredanne
  • @tg1999
  • @keshav-space
  • @Hritik14
  • @jmhoran

Related Issues:

Description:

See https://github.com/nexB/vulnerablecode#how for background info. We want to search for more vulnerability data sources and consume them.

There is a large number of pending tickets for data sources. See https://github.com/nexB/vulnerablecode/issues?q=is%3Aissue+is%3Aopen+label%3A"Data+collection"

Also see tutorials for adding new importers and improvers:

More reference documentation in improvers and importers:

Note that this is similar to this GSoC 2022 project (a continuation):


VulnerableCode: On demand live evaluation of packages (Category A)

Repository: https://github.com/nexB/vulnerablecode

Size: Large

Difficulty Level: Intermediate

Tags: [Python], [Django], [PostgreSQL], [Security], [web], [Vulneribility], [API]

Mentors:

  • @pombredanne
  • @tg1999
  • @keshav-space

Related Issues:

Description:

Currently vulnerablecode runs importers in bulk where all the data from advisories are imported and stored to be displayed.

The objective of this project is to have another endpoint and API where we can:

  • support querying a specific package by PURL
  • we visit advisories/package ecosystem specific vulneribility datasources and query for this specific package
  • this is irrespective of whether data related to this package being present in the db (i.e. both for new packages and refreshing old packages)

VulnerableCode: Implement new improvers (Category A)

Repository: https://github.com/nexB/vulnerablecode

Reference:

Size: Large

Difficulty Level: Intermediate

Tags: [Python], [Django], [PostgreSQL], [Security], [web], [Vulneribility], [API]

Mentors:

  • @pombredanne
  • @tg1999
  • @keshav-space
  • @jmhoran

Related Issues:

Description:

One example is: make improver to infer the affected ranges for the advisory data sources that only gives fixed version

Take for example AlpineLinux Importer, we can only get the fixed versions from the importer. The aim of this project is to make a generic improver to infer the affected ranges with the help of fixed versions.


VulnerableCode/Vulntotal: Browser Extension (Category B)

Repository: https://github.com/nexB/vulnerablecode

Reference: https://github.com/nexB/vulnerablecode/tree/main/vulntotal

Size: Medium

Difficulty Level: Intermediate

Tags: [Python], [Security], [Web], [Vulneribility], [BrowserExtension], [UI]

Mentors:

  • @keshav-space
  • @pombredanne
  • @tg1999

Related Issues:

Description:

Implement a firefox/chrome browser extension which would run vulntotal on the client side, and query the vulneribility datasources for comparing them. The input will be a PURL, similarly as vulntotal.

  • research tools to run python code in a browser (brython/pyscript)
  • implement the browser extension to run vulntotal

ScanCode.io project ideas


ScanCode.io: Create GitHub SBOM creation action(s):

Repository: https://github.com/nexB/scancode.io

Reference: https://github.com/nexB/aboutcode/wiki/Project-Ideas-Create-GitHub-SBOM-action

Size: Large

Difficulty Level: Intermediate

Tags: [Python], [Django], [CI], [Security], [Vulneribility], [SBOM]

Mentors:

  • @pombredanne
  • @tdruez
  • @keshav-space
  • @tg1999
  • @AyanSinhaMahapatra

Related Issue:

Description:

Create a GitHub action using scancode.io:

  • use package/dependencies/vulnaribility data from scancode.io
  • to output a SPDX/CycloneDX SBOM
  • upload this as an artifact created by the action (like artifacts created on tag push/release)

ScanCode Toolkit project ideas


Compute summary for all detected packages:

Repository: https://github.com/nexB/scancode-toolkit

Reference:

Size: Large/Medium

Difficulty Level: Advanced

Tags: [Python], [Summary], [Packages]

Mentors:

  • @pombredanne
  • @jyang
  • @AyanSinhaMahapatra

Related Issue:

Description:

Today the summary and license clarity scores are computed for the whole scan. Instead we should compute them for EACH package (and their files). This is possible now that we are returning which file belong to a package.


Mark required phrases for rules automatically using NLP/AI:

Repository: https://github.com/nexB/scancode-toolkit

Size: Large/Medium

Difficulty Level: Advanced

Tags: [Python], [ML/AI], [Licenses]

Mentors:

  • @pombredanne
  • @jyang
  • @AyanSinhaMahapatra

Related Issue:

Description:

Mark required phrases in licenses automatically with ML/AI


Our Project ideas

Here are some project related attributes you need to keep in mind while looking into prospective project ideas, see also finding the right project guide:

Project Priority

  1. The repositories/projects are sorted in order of importance, (i.e. PURLdb, vulnerablecode and scancode.io are the most important ones, and then there are all other projects).

  2. The project ideas within a project are not sorted by priority.

  3. This doesn't mean we will always consider a project proposal with a higher priority idea over a relatively lower priority one, no matter the merit of the proposal. This is only one metric of selection, mostly to prioritize important projects.

  4. You can also suggest your own project ideas/discuss changes/updates/enhancements based on the provided ideas, but you need to really know what you are doing here and have lots of discussions with the maintainers.

Project Length

There are three project lengths:

  1. Small (~90 hours)
  2. Medium (~175 hours)
  3. Large (~350 hours)

If you are proposing an idea from this ideas list, it should match what is listed here, and additionally please have a discussion with the mentors about your proposed length and timeline.

We have marked our ideas with medium/large but this is tentative and a best guess only. In a few cases they are both used to mark a project as it can be both. But still most of these are on the larger side, as these are large complex projects and you're likely underestimating the complexity (and how much we'll bug you to make sure everything is up to our standards) if you're proposing a medium length project anyway. You must discuss your proposal and the size of project you are proposing with a mentor as otherwise we cannot consider your proposal fairly.

We likely would only select medium/large project ideas only as the small projects are too small to get familiar with and contribute meaningfully to any of our projects.

Please also note that there is a difference in the stipend based on what you select also.

Project Tags

Here are all the tags we use for specific projects, feel free to search this page using these if you only want to look into projects with specific technical background.

[Django], [PostgreSQL], [Web], [DataStructures], [Scanning], [Javascript], [UI], [LiveServer] [API], [Metadata], [PackageManagers], [SBOM], [Security], [BinaryAnalysis], [Scraping], [NLP], [Social], [Communication], [Review], [Decentralized/Distributed], [Curation]

Project Difficulty Level

We are generally using two level of difficulty to characterize the projects:

  • Intermediate
  • Advanced

If it is a difficult project it means there is significant domain knowledge required to be able to tackle this project successfully, and while this domain knowledge is not a hard pre-requirement before you start, you must consult with mentors/maintainers early, ask a lot of domain specific questions and must be ready to research and tackle greenfield projects if you choose a project in this difficulty category.

Most other intermediate projects do not require this much domain knowledge and can easily be acquired during proposal writing/contributing, if you're familiar with the tech stack used in the projct.

Clone this wiki locally