Networks are powerful tools that allow us to model and understand complex biological systems by representing entities such as genes, proteins, or patients as nodes, and their interactions or relationships as edges.
A graph
- directed,
- undirected, or
- weighted.
The adjacency matrix, symbolized as
-
Title: The Cancer Genome Atlas Lung Adenocarcinoma (TCGA-LUAD)
-
Main Focus: Study of lung adenocarcinoma (a common type of lung cancer)
-
Data Collected: Comprehensive genomic, epigenomic, transcriptomic, and proteomic data from lung adenocarcinoma samples
-
Disease Types:
- Acinar Cell Neoplasms
- Adenomas and Adenocarcinomas
- Cystic, Mucinous, and Serous Neoplasms
-
Number of Cases: 585 (498 with transcriptomic data)
-
Data Accessibility: Available on the NIH-GDC Data Portal
-
Link: TCGA-LUAD Project Page
Our main goal is to learn how to convert tabular data into network data in a way that is robust and reproducible. We will discuss different ways how to define the edges from the tabular data. We will also look at common techniques how to clean the resulting network all the while we are learning functionality of a popular Python package for working with network data, called NetworkX.
By the end of this session you will be comfortable to inspect, analyse and create you own network data for tasks such as protein-protein interaction (PPI), gene regulatory network (GRN) and patient-similarity network (PSN) analysis.
We will calculate correlation matrices from gene expression data, which will serve as the basis for constructing gene expression networks. The correlation matrix captures the relationships between genes by calculating the correlation coefficient between their expression profiles.
There are a few correlation metrics one could consider:
- Pearson
- O(n^2) complexity, fast for large datasets
- Spearman
- O(n^2 log n) complexity, relatively fast but can be slower than Pearson
- Absolute biweight midcorrelation
- Robust but slower than Pearson and Spearman, suitable for datasets with outliers
From the correlation matrix, we will construct a gene expression network where the nodes are genes and the edges represent the strength of the correlation between them. The network will allow us to identify highly connected genes, which are likely to be functionally related.
Networks generated from real-world data often contain noisy or irrelevant connections, which can impact downstream analysis. We will discuss various strategies for cleaning networks, such as removing low-confidence edges, filtering nodes based on their connectivity, and identifying and removing outlier nodes.
Sparsification is a technique used to reduce the density of a network by removing edges while preserving the overall network structure. We will explore different sparsification methods, such as thresholding based on edge weights or network density, to simplify the network and improve interpretability.
Highly connected nodes, also known as hubs, play a crucial role in network structure and function. We will discuss how to identify and characterise hubs in a network, including their biological relevance.
In the second half of this exercise we are going to generate two patient similarity networks. One of them is based on the gene expression data we used in Part 1. The other one is based on DNA methylation as a preliminary step for Section 2.