forked from RangerNJU/Static-Program-Analysis-Book
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 043cc96
Showing
21 changed files
with
787 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# Node rules: | ||
## Grunt intermediate storage (http://gruntjs.com/creating-plugins#storing-task-files) | ||
.grunt | ||
|
||
## Dependency directory | ||
## Commenting this out is preferred by some people, see | ||
## https://docs.npmjs.com/misc/faq#should-i-check-my-node_modules-folder-into-git | ||
node_modules | ||
|
||
# Book build output | ||
_book | ||
|
||
# eBook build output | ||
*.epub | ||
*.mobi | ||
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# Static-Program-Analysis-Book | ||
|
||
[Gitbook在线阅读地址](https://ranger-nju.gitbook.io/static-program-analysis-book/) | ||
|
||
[GitHub项目地址](https://github.com/RangerNJU/Static-Program-Analysis-Book) | ||
|
||
---- | ||
|
||
Getting started with static program analysis. Read this and start writing your first static program analyzer! | ||
|
||
静态程序分析入门。阅读此书并着手编写你的第一个静态程序分析器吧! | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# Table of contents | ||
|
||
* [第零章-写在前面](src/ch0/README.md) | ||
* [静态程序分析是啥玩应啊](src/ch0/00-01-why-SPA.md) | ||
* [为什么是这本书?](src/ch0/00-02-why-this-book.md) | ||
* [来源与版权信息](src/ch0/00-03-sources-and-license.md) | ||
* [第一章-静态程序分析简介](src/ch1/README.md) | ||
* [初见——静态分析是什么?](src/ch1/01-01-whats-spa.md) | ||
* [中间表示——静态分析器的输入](src/ch1/01-02-ir.md) | ||
* [第二章-数据流分析的理论与应用](src/ch2/README.md) | ||
* [数据流分析](src/ch2/02-00-dataflow-analysis.md) | ||
* [到达定值分析](src/ch2/02-01-reaching-def-analysis.md) | ||
* [活跃变量分析](src/ch2/02-02-live-var-analysis.md) | ||
* [可用表达式分析](src/ch2/02-03-available-exp-analysis.md) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# Static-Program-Analysis-Book | ||
|
||
[Gitbook在线阅读地址](https://ranger-nju.gitbook.io/static-program-analysis-book/) | ||
|
||
[GitHub项目地址](https://github.com/RangerNJU/Static-Program-Analysis-Book) | ||
|
||
---- | ||
|
||
Getting started with static program analysis. Read this and start writing your first static program analyzer! | ||
|
||
静态程序分析入门。阅读此书并着手编写你的第一个静态程序分析器吧! |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# 为什么你需要了解静态程序分析 | ||
|
||
## 定位 | ||
|
||
**静态程序分析**是**编程语言**中**应用**层面下的一个细分领域。 | ||
|
||
![](00-01-why-SPA.assets/PL.png) | ||
|
||
(TODO:举一些具体的例子) | ||
|
||
当今编程语言可以主要分为三大类 | ||
- 命令式(C、C++、JAVA) | ||
- 函数式(Scale、Haskell) | ||
- 逻辑式(Prolog、SQL) | ||
|
||
*之后的内容主要关注于针对命令式语言的分析。* | ||
|
||
当今**编程语言**这个分支下,面临这样一条恶龙:`数十年来语言的核心没有变化,但软件的规模和复杂性增长迅速,如何保证程序的可靠性?` | ||
|
||
## 应用 | ||
|
||
静态程序分析即是屠龙的宝刀之一,掌握并应用这一技术,能够: | ||
|
||
1. 提高程序可靠性——Null pointer dereference, memory leak, etc.(空指针引用与内存泄漏等) | ||
2. 提高程序安全性——Private information leak, injection attack, etc.(隐私信息泄漏与注入攻击等) | ||
3. 为编译优化提供基础技术——Dead code elimination, code motion, etc.(死代码消除和代码向循环外移动等) | ||
4. 有助于程序理解——IDE call hierarchy, type indication, etc.(为集成开发环境的功能提供帮助) | ||
|
||
## 市场 | ||
|
||
在学术界,静态程序分析技术几乎可以应用于所有关于程序的研究方向。 | ||
|
||
在工业界,国外的Google,IBM等大企业已经初步建立了自己的静态程序分析团队。国内的华为和阿里等企业也正在寻找静态程序分析方面的人才。 | ||
|
||
(TODO:添加更为详细的例子) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# 为什么是这本书? | ||
|
||
> 你说的静态程序分析似乎有点儿用处,那么哪里可以学到呢? | ||
## 为什么应该读这本书? | ||
|
||
**1. 当前的中英文社区都缺乏这一领域的入门材料。** | ||
|
||
**2. 本书将带领读者,通过理论和实践的结合了解这一领域。** | ||
|
||
### 中文社区 | ||
|
||
在搜索引擎上搜索相关中文关键词,你会发现结果靠前的答案都是与某南的李老师相关的课程在B站上公开视频的笔记,其中有不少写得很好,**但并非面向一般学习者开发者的教程**。这两者有重要的区别: | ||
|
||
- 笔记:**面向自己**复习使用,只要自己回顾时能迅速pick up当时理解到的重点,就是一份好的笔记。 | ||
- 教程:**面向学习者**使用,一份好教程能让学习者迅速把握领域中的重点,并且为学习者的进一步应用打下基础。 | ||
|
||
### 英文社区 | ||
|
||
在搜索引擎上搜索英文关键词,你应该能搜索到国际上的大牛们的教材式的PDF文件和相关论文,或是开源的静态分析程序。但同样**缺乏教程**。大多数材料要么艰涩难懂要么太过粗浅。根据粗略的访问,我也了解到业界认为静态程序分析技术仍不成熟。 | ||
|
||
### 理论与实践的结合 | ||
|
||
本书将同时涉及理论和实践,这主要是受到了《The Rust Programming Language》的启发。 | ||
|
||
### 本书写作的目标 | ||
|
||
能让大多数有一定编程经历,已经修过本科计算机基础课程的大四及以上学力(不是学历)的同学: | ||
|
||
1. 在阅读本书时能较为轻松地理解理论 | ||
2. 能够自主完成原型实现 | ||
3. 能在阅读过程中接触CS不同领域的小知识 | ||
|
||
## 为什么要写这本书? | ||
|
||
- 最主要的动力还是老师现场授课时我感受到的passion | ||
- 这是一个少有人涉足的领域,写这方面的内容很符合我的性格 | ||
- 我从开源社区中获益颇多,找到了合适的机会也希望能为开源社区(尤其是中文社区)作出自己的贡献 | ||
- 人类发展的历史重要的两部分是传承和发展。领域先锋发展探索,亦需要有人将新的知识传播开来 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# 来源与版权信息 | ||
|
||
## 资料来源 | ||
|
||
本入门教程基于南京大学《软件分析》课程。 | ||
|
||
[PASCAL研究组主页](https://pascal-group.bitbucket.io/teaching.html) | ||
|
||
## 版权信息 | ||
|
||
教程文字部分遵循CC BY-NC-SA许可。 | ||
|
||
图片部分若无特殊说明,均为课程Slides的一部分,本书中使用这些图片已获Slides作者同意。 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# 写在前面 | ||
|
||
记录一些你在决定认真阅读本书之前需要了解的信息。 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
# 静态程序分析是什么 | ||
|
||
> 静态程序分析是指**不编译**出二进制代码通过测试用例对程序进行测试,仅通过**静态地**分析程序得到程序**不平凡**的性质的过程。 | ||
|
||
## 静态程序分析的抽象定义与诠释 | ||
|
||
### 一句话定义(TODO:中文翻译) | ||
|
||
> Static analysis analyzes a program P to reason about its behaviors and determines whether it satisfies some properties befo re running P. | ||
### 一句话诠释(TODO:中文翻译) | ||
|
||
> Ensure (or get close to) soundness, while making good trade-offs between analysis precision and analysis speed. | ||
在分析精度和速度之间做平衡的同时,保证(或近似)soundness。 | ||
|
||
## 静态程序分析的具体解释 | ||
|
||
静态程序分析技术中最重要的两个技术,分别是Abstraction(抽象)和Over-Approximation(过近似) | ||
|
||
### Abstraction | ||
|
||
**抽象是将值从Concrete Domain(具体域)映射到Abstract Domain(抽象域)的过程。** | ||
|
||
举个例子:(TODO:加图) | ||
|
||
Concrete Domain中,变量的值可以是具体的值,也可能是某种表达式或函数的返回值。 | ||
|
||
Abstract Domain中,变量的值分为五类: | ||
|
||
- 正 | ||
- 负 | ||
- 零 | ||
- unknown(未知):根据表达式或函数返回决定,程序运行时会有具体的正负零数值,但运行前只通过该表达式无法确定。如应用C语言中的三目运算符(TODO:加Link)`x = flag ? 1 : -1`中x的值就是unknown的。 | ||
- undefined(未定):程序运行时会遇到错误,并产生未定义行为(TODO:加Link)。如许多语言中divided by zero(除数为零)通常会触发异常/硬件错误,此时如`a=b/0`中a的值就是undefined的。 | ||
|
||
其中unknown通常表达为正的T,读作top;undefined通常表达为上下颠倒的T,读作bottom。 | ||
|
||
### Over-approximation | ||
|
||
**过近似主要是指对抽象值进行操作时的思想。** | ||
|
||
继续上面的例子,具体来说操作可以分为两类:Data flow(数据流)和Control flow(控制流): | ||
|
||
#### Data flow | ||
|
||
(TODO-加图) | ||
|
||
- 两个正数相加为正数 | ||
- ... | ||
- 正数和负数相加,结果为unknown/top。**这是因为运行前只知道表达式的正负号,无法确定结果是正负零中的任何一个,但程序运行时会有具体的数值。** | ||
- unknown/top除以0,结果为undefined。**这是因为会触发未定义行为。** | ||
|
||
#### Control flow | ||
|
||
在程序执行过程中往往会有分支结构,如果x的值在两个分支中分别被赋抽象正值和抽象负值,那么合并时x抽象值应该是五类中的哪一种呢? | ||
|
||
对于分支合并时的x,**运行前只知道抽象值的正负号,无法确定结果是哪一种,但程序运行时会有具体的数值**。无论是抽象为正还是负都无法准确描述x合并后的状态,所以x合并后的抽象值是unknown/top。 | ||
|
||
|
||
## Rice‘s Theorem与静态程序分析目标 | ||
|
||
### Rice's Theorem | ||
|
||
> Any non-trivial property of the behavior of programs in a r.e. language is undecidable. | ||
(TODO:加中文) | ||
|
||
(TODO:加中文) | ||
这句话中有很多很难的词汇,接下来一一解释: | ||
- non-trivial properties ~= interesting properties ~= the properties related with run-time behaviors of programs | ||
- r.e. (recursively enumerable) 递归可枚举语⾔: recognizable by a Turing-machine | ||
- There is no such approach to determine whether P satisfies such non-trivial properties, i.e., giving exact answer: Yes or No | ||
- 故不存在 perfect (sound & complete) static analysis | ||
|
||
### 两个重要概念:Sound与Complete | ||
|
||
> 思考: 作为一个开发者,你使用静态分析器分析自己的代码时,哪种情况更让你觉得糟糕? | ||
> | ||
> - 静态分析器**没有分析出代码错误**的部分。 | ||
> - 静态分析器**判断代码正确的部分为错**的部分。 | ||
Over- and under-approximations are both for safety of analysis. | ||
|
||
sound: 报出所有问题 may analysis: outputs information that may be true (over-approximation) (safe=over) | ||
|
||
complete: 报出的问题都是对的 must analysis: outputs information that must be true (under-approximation) (safe=under) | ||
|
||
|
||
### 实际应用中静态分析器的设计目标 | ||
|
||
(TODO:加图) | ||
**实际应用中往往没有办法做到完美。因而需要妥协。** | ||
|
||
需要静态分析器能在可接受的时间内给出精度满足要求的解。为此: | ||
|
||
妥协 soundness (false negatives 可能漏报) | ||
|
||
妥协 completeness (false positives 可能误报) (⼤多数情况的分析,因为 soundness 很重要) | ||
|
||
## 再讲五块钱的? | ||
|
||
### 关于未定义行为(TODO:加Link和解释) | ||
|
||
### 关于程序测试与分析(TODO:加程序分析的LINK和对比) | ||
|
||
### 关于判定问题中经常用到的术语 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
# 中间表示——静态分析的输入 | ||
|
||
## 从编译器的组件谈起 | ||
|
||
一个典型的编译器分成一下几个部分: | ||
- Scanner:读入源代码,借助正则表达式完成词法分析,输出Tokens或报告错误的词法输入。如`goouojd`并是一个有效的英文单词,就会被报告为错误。 | ||
- Parser:读入Tokens,通过上下文无关文法完成语法分析,构建抽象语法树或报告错误的语法输入。如`Like your hair I`就不符合英语语法,会被报告为错误。 | ||
- Type Checker:读入抽象语法树,以属性文法进行语义分析并检查类型兼容性等,输出Decorated AST或报告错误的语义。如`Apples eat you`在语法上正确,但不可能有苹果会吃人,这是语义上的错误。 | ||
- Translater:读入Decorated AST,通过遍历Decorated AST将树型IR(AST)翻译为线型IR(通常是三地址码的形式)。然后可以**通过静态分析**进行机器无关的代码优化。 | ||
- Code Generator:读入线型IR,并根据指定的CPU指令集将机器无法直接执行的IR转换为机器可直接执行的机器代码。 | ||
|
||
### AST与IR的比较 | ||
|
||
AST: | ||
- high-level and closed to grammar structure | ||
- usually language dependent | ||
- suitable for fast type checking | ||
- lack of control flow information | ||
|
||
即: | ||
- 更接近于语法规定的结构 | ||
- 通常与语言有关 | ||
- 适于快速类型检查 | ||
- 没有控制流信息 | ||
|
||
IR: | ||
- low-level and closed to machine code | ||
- usually language independent | ||
- compact and uniform | ||
- contains control flow info | ||
- usually considered as the basis for static analysis | ||
|
||
即: | ||
- 更接近于机器码 | ||
- 通常与语言无关 | ||
- 紧凑而通用 | ||
- 包含控制流信息 | ||
- 通常被视为静态分析的基础 | ||
|
||
**总的来说,线型IR更适合静态分析。线型IR没有固定的定义,实际使用中常用三地址码。** | ||
|
||
~~在计算机领域,直接用英文常常更容易将概念表达清楚,如果有读者认为有更好的翻译,可以联系作者(邮箱或github issue都可以)~~ | ||
|
||
### 三地址码 | ||
|
||
> Definition: Each 3AC contains at most 3 addresses (name, constant, temporary) | ||
定义:每一条三地址码最多包含3个地址(地址包括程序员显式定义的有名字变量,常量和临时变量) | ||
一些形式 | ||
|
||
TBD |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# 入门指南 | ||
|
||
在这一部分中,将介绍什么是静态程序分析(下文将简称为静态分析),这一技术有什么样的应用,为什么它值得我们去学习与研究。 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# 数据流分析 | ||
|
||
TBD | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# 数据流分析之到达定值分析 | ||
|
||
TBD | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# 数据流分析之活跃变量分析 | ||
|
||
TBD | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# 数据流分析之可用表达式分析 | ||
|
||
## 定义 | ||
|
||
> An expression x op y is available at program point p if (1) all paths from the entry to p must pass through the evaluation of x op y, and (2) after the last evaluation of x op y, there is no redefinition of x or y. | ||
## 理解定义:一个tricky的例子 | ||
|
||
1. 按照直觉来说,c处不是,但是按照定义来说是。但是实际上运行时只会走一条分支。 | ||
|
||
2. 回顾: | ||
must->under-approximation->报出来的全都是真的。不允许误报,但允许漏报 | ||
may->不允许漏报,但允许误报 | ||
重新理解这个例子。 | ||
|
||
PS. 如何记住边界条件设置谁?设置是空还是满?对余下所有块的初始化是空还是满(一个巧妙的方法是判断may/must后,如果条件汇合时是交集,一般开始时大多数块初始化为满。) | ||
|
||
一个经常迷惑的问题:先gen先kill?根据定义就解决所有问题2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# 数据流分析的理论与应用 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# 数据流分析例一——Reaching Definition | ||
|
||
TBD | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# 可用表达式分析 | ||
|
||
## 定义 | ||
|
||
> An expression x op y is available at program point p if (1) all paths from the entry to p must pass through the evaluation of x op y, and (2) after the last evaluation of x op y, there is no redefinition of x or y. | ||
## 理解定义:一个tricky的例子 | ||
|
||
1. 按照直觉来说,c处不是,但是按照定义来说是。但是实际上运行时只会走一条分支。 | ||
|
||
2. 回顾: | ||
must->under-approximation->报出来的全都是真的。不允许误报,但允许漏报 | ||
may->不允许漏报,但允许误报 | ||
重新理解这个例子。 | ||
|
||
PS. 如何记住边界条件设置谁?设置是空还是满?对余下所有块的初始化是空还是满(一个巧妙的方法是判断may/must后,如果条件汇合时是交集,一般开始时大多数块初始化为满。) | ||
|
||
一个经常迷惑的问题:先gen先kill?根据定义就解决所有问题2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# 数据流分析的理论与应用 |