Q&A with Lai Jianxin on Static Code Analysis
Lai Jianxin is Xcalibyte’s Head of Research & Development for their static code analysis tool, Xcalscan.
What are your responsibilities?
My current job responsibility is to lead the core R&D team to develop a next generation SAST tool. The analytical engine of this tool is the core component. Firstly, it converts source code to an intermediate representation and then performs flow sensitive analysis, inter-procedural analysis, context sensitive analysis and object sensitive analysis. Our static code analyzer is built on top of those analysis methods and combines symbolic execution and formal verification. It detects if the program has defects, security vulnerabilities and any code that violates source coding conventions and standards or user-defined rules.
What is static code analysis?
Static code analysis refers to the analysis of code semantics and behaviour without actually executing the program. This allows us to find out the program’s abnormal program semantics or undefined behaviour due to poor quality or incorrect coding in the software program. Basically, the use of static code analysis allows for code errors to be found while the code is being written. You do not need to wait for the whole program to be completed, nor built in a running environment with written test cases. It can detect defects in the early part of the software development process thereby improving development efficiency and ensuring high quality, defect-free code.
What is the biggest problem with SAST tools today?
The scale of modern software systems is getting larger and larger. The lines of code have increased from tens of thousands to tens of millions. The computer systems have become more complicated, from traditional single processor systems to distributed ones, from homogeneous computing to heterogeneous computing. In addition, software development has also evolved from the use of a single language to multiple languages for collaborative development. These changes all pose significant challenges to SAST tools, which must have the ability to simultaneously deal with multiple language code and their interoperability. For example, to detect Android application vulnerabilities, apart from the detection in C/C ++ and Java languages, the tool should also support the Java Native Interface (JNI) to be able to detect problems caused by Byte Code and Native Code interoperation. There are other indicators for judging an excellent SAST tool. Firstly, the rate of false negatives and false positives. High false negative or high false positive rates reduce the effectiveness of the SAST tool significantly, and so is unlikely to improve development efficiency and software quality. The second is how easy the detection rules can be extended and customized according to user-specific requirements. In addition to supporting common industrial security coding standards such as CERT, SAST tools should be able to support user-defined coding conventions and standards as well as business logic rules. The third is the time spent and resource consumption required for analysis. If the SAST tool takes too much time for one scan or takes up too much memory resource, it is difficult to integrate with the programmer’s daily development processes. It certainly will not be able to improve development efficiency and software quality.
In compiler technology, what is the Abstract Syntax Tree (AST)?
An abstract syntax tree is a tree-like representation of a program’s source code structure. The source code of the program is processed through a lexical analyzer (Lexer) to obtain different kinds of words (tokens), and then analysed by a parser to obtain an Abstract Syntax Tree (AST). The abstract syntax tree represents the entire program, which includes the root node, the intermediate nodes of abstract syntax structures and terminal nodes. The key value of AST is that it corresponds to each syntactical element of the input source code. For the C language source code example in Figure 1, its corresponding AST is shown in Figure 2.
Figure 1: C code for while loop
Figure 2: AST of the while loop
IR is the core element of a compilation system or the static analysis system. It is the internal representation of the source program in the compiler or static analyzer. All code analysis, optimization, and transformation are performed based on the intermediate representation. Generally speaking, IR is converted from AST after type checking and canonicalization. For the compiler, after conducting the analysis and optimization at the IR level, it converts the IR into the assembly or object code for the processor. The static analysis tool performs in-depth analysis about the program’s semantic or undefined behavior on the IR and combines predefined rules or user-defined rules to detect vulnerabilities or defects in the source code. In modern compilers and static analysis tools, Control Flow Graph (CFG) is usually used to represent the program’s control logic, and Static Single Assignment (SSA) represents the factored Use-Def Chain, which do not exist in AST. For the C language source code in Figure 1, its corresponding IR is shown in Figure 3.
Figure 3: IR of the while loop
As mentioned, AST can be converted into IR after type checking and canonicalization. AST is suitable for some code specification checks, such as naming convention checks or coding idioms checks. AST checks usually are performed by graph pattern matching. The IR can perform deeper flow sensitive, inter-procedural, context sensitive and object sensitive analysis to achieve more precise program vulnerability checks. Compared with IR, the disadvantages that AST has is obvious: AST cannot represent control flow and data flow well. As a tree representation of input source code, AST lacks a way to express control flow and data flow. In addition, AST is not canonicalized. If the same semantic structure is written differently, their representation on the AST will also be different. For example, in the C language, the loop structures written in ‘for’, ‘while’, and ‘if/goto’ respectively have different ASTs; the control flow graphs generated after conversion to IR are the same. Canonicalization makes analysis of program semantics easier which leads to greater accuracy in defect detection.
What other benefits does analysis at the IR level have?
One of the restrictions of AST is that it is usually language dependent. For example, C programs have a corresponding C AST and Java programs have a corresponding Java AST. However, the benefit of IR is that generally it is independent of the source language, whether it is C source code, Java source code or any other language source code, they all can be converted to a common IR. If we use unified analysis and detection engines on the IR and combine them with different language specific rules, we can detect defects in programs written in various languages. Another benefit of using IR is that IR is more stable than AST. For example, the C++ specification introduces a new standard and result in new syntax structures every three years. This means that a new AST structure needs to be created every three years. If the analysis engine is based on the AST, the analysis engine also needs to be updated every three years to process these new nodes. On the other hand, if the analysis engine is based on the IR, only the new AST node needs to be converted into the existing IR Structure and this leaves the complex analysis engine unaffected.
In this regard, how does Xcalscan differentiate itself from other SAST tools?
The advantages of Xcalscan can be seen from the following three aspects:
Firstly, it is an innovative and scalable Flow Sensitive, Object Sensitive, and Context Sensitive analysis engine. Flow-sensitive means that the analysis engine will distinguish the definition and use of program variables under different execution paths. It reports warnings on execution paths that cause errors only. Object-sensitive means that the analysis engine can distinguish between different objects or different members in the same object. It only reports warnings on objects or object members that cause errors. Context-sensitive means that the analysis engine can distinguish the context of the same function at different call sites, and only report errors on call sites that cause errors. The Xcalscan analysis engine scans the sensitivities of flow, object and context by using a combination of single static assignment (SSA), virtual variables (Virtual Symbol), SSA-based alias analysis and Inter-Procedural Analysis. Xcalscan can effectively report where the source of each problem lies, what kind of path the data has assigned among statements and passed across the function boundaries. Xcalscan’s graphical user interface uses a flow graph to show how problems are introduced from the source, step by step, and where they will eventually trigger.
Secondly, the vulnerability detection engine integrates multiple methods such as variable, virtual variable, data flow analysis, symbol execution to analyze the program semantics based on the set of predefined coding conventions, standards or rules.
Finally, it is an extensible user-defined rule engine. Xcalscan defines and exposes analysis and rule checking APIs for developing user-defined rules. Users can specify the attributes and side effects of their code or third-party library functions, pre- or post-conditions and check rules by calling related APIs. The Xcalscan rule engine will automatically read user rules and annotate them to the intermediate representation. When the static analysis is performed, the rule engine will determine whether the pre- or post-conditions are met and report any violations.
Lai Jianxin has a background in compiler optimisation and advanced program static analysis. After graduating from Tsinghua University with a Master’s Degree in Computer Science in 2006, he joined Hewlett-Packard’s compiler team, worked as a compiler engineer, compiler back-end architect and project manager, and worked on the open source compiler Open64, HP-UX product compiler aCC and HP Non-Stop product compiler. He joined Xcalibyte in 2018.