Most technologists agree that Moore’s Law, the two-fold increase of the number of transistors placed upon a chip each year, has come to an end. The projections for increased computing power are now focused on other areas. This includes compiler technologies. We invited Jianxin Lai, Xcalibyte’s R&D Director, to provide an explanation of how compilers work.
01 What is Compiler Technology?
Compiler technology was created at the same time as high-level programming languages. In its most basic form, compiler technology refers to the translation of high-level language into machine code. If there is no compiler then programmers can only use machine code, which is composed of 1’s and 0’s, to write programs. In today’s world, writing in machine code is almost impossible when trying to complete tasks and projects. The need for compiler technology is crucial in software development; it runs from top to bottom in software and provides the means to drive hardware.
In addition to commonly used ‘high-level language to machine code’ compilers, compiler technology is also widely used in source-to-source compilers from Domain Specific Languages (DSL) to general programming languages, such as smart contract compilers in the blockchain and compilers used by web developers to compile other high-level languages into JavaScript, high-level language interpreters e.g. Python, virtual machines e.g. Java VM or WebAssembly VM, and binary translator e.g. Rosetta 2 on Apple Silicon M1.
As the scale of modern software applications continue to grow, the technology stack becomes increasingly complex. Only by exploiting the full capabilities of compiler technology can the integrity, security, performance, readability, maintainability, portability, and developer productivity be maintained at required levels demanded of modern software applications.
02 As a developer new to compiler technology, what kind of knowledge do you need?
Familiarity with the C and C++languages, as well as common data structures, especially the construction, search, traversal and transformation algorithms of trees and graphs, is needed. Advanced compiler technology development requires a deep understanding of the source language, target language or instruction set and the micro-architecture of the processor.
03 What are the key challenges for modern compilers in today’s world?
Since the invention of FORTRAN, compiler technology has continued to advance with the development of new high-level languages and computer technology. Early compilers focused on code generation technology from high-level language conversion to machine code where the main problems they solved were instruction selection and register allocation. With the level of abstraction from the continuous improvement of high-level languages and the complexity of modern processors and storage systems, the compiler optimization has become increasingly important. Compiler optimization techniques can generally be divided into machine-independent optimization and machine-specific optimization. The former can help eliminate the extra overhead introduced by high-level language abstraction while improving the programmer’s development efficiency, increasing the readability, maintainability, and portability of the program. The latter can help the program make full use of the resources and characteristics of the modern processor and memory hierarchy to improve the overall program performance.
With the advent of open-source software and cloud computing, software development has gradually shifted from closed and in-house development to open source-based software assembly and integrated development. From a stand-alone environment using a single programming language to a client/ server model using two languages, and then to multilingual cloud-edge-device complex environment development. Clearly, it some brought new challenges to the compilers. For example, how can the compiler itself be better optimized to support the construction of large-scale systems such as Android or Chromium? In addition to code generation, how can you apply compiler technology to the development process to improve programmer development efficiency and solve code quality problems? In mixed language development or tripartite library integration, how can you use compiler technology to prevent bugs that cross language boundaries or tripartite library boundaries? These issues are the key challenges facing today’s compilers and will be looked at in this article.
04 Which part of the compiler is the most important?
First, let’s take a look at the overall architecture of the compiler in the diagram below.
The front-end of the compiler is responsible for converting the source language into an intermediate representation (IR), including lexical analysis, syntax analysis, syntax checking, and intermediate representation generation. Lexical analysis and syntax analysis are based on grammar and automata theory. Syntax checking traverses the Abstract Syntax Tree (AST) generated by syntax analysis and checks whether the nodes and hierarchical relationships of the tree conform to the source language specification. The intermediate code generation is to generate the intermediate representation defined by the compiler after simplifying and normalizing the abstract syntax tree.
The responsibility of the compiler back-end is the process of generating assembly code or object code after compiling, analyzing and optimizing the intermediate representation. The pre-link phase is a necessary step of inter-procedure analysis. It uses the same link rules and sequence as the final link phase to determine the targets of global functions and variables and builds a function call graph based on this. Compilation analysis mainly includes control flow analysis, data flow analysis, context analysis and alias analysis. These analyses provide optimization basis and strategies for subsequent optimization and code generation. Machine-independent optimization includes inter-procedural optimization, loop optimization and scalar optimization. Its optimization goals and basis are independent of the specific processor architecture. Machine-specific optimization includes instruction selection and scheduling, register allocation, and optimization for specific hardware structures. In the code generation stage, the intermediate representation of register allocation is completed, and the target code is generated through the built-in or external assembler. Object code files finally are linked by the linker to generate executable file or loadable module.
Amongst the various parts of the compiler, compilation analysis is becoming more and more important. Compilation analysis is not only essential for compilation optimization; it is also the most critical step for emerging compiler application scenarios, such as static code scanning. The control flow analysis technology and data flow analysis technology within the function have been maturing and are widely used in various compiler implementations, but alias analysis and context analysis still face huge challenges. From this perspective, alias analysis and context analysis become the most important part of the compiler.
Alias analysis is used to determine whether the memory or objects pointed by two pointers or references are completely overlapped, partially overlapped or independent of each other. Partial overlap exists in languages such as C / C++ that allow union data types or pointer arithmetic operations. On the one hand, the difficulty of alias analysis is to track the definition of pointer variables in every context of function call and every possible path of function execution by following the use-def chain. The algorithm complexity is exponential. On the other hand, it is the alias analysis of pointers in randomly accessible or recursive data structures such as arrays, linked lists, trees, and graphs, that has almost no precise results. Alias analysis is very important for optimization and static code scanning. Let us use the C language as an example as follows, assuming that a, b, c, and d are all integer variables, and p and q are pointers to integer variables:
Context analysis puts the function into the context of the function call point, determines the value of the parameters and global variables before the call, and analyzes the function behavior to determine the return value and the changes of the global variables after the function call ends. The difficulty of context analysis is that the complexity of the algorithm for the complete analysis is also exponential. For example, suppose function A calls function B in 3 different positions, and function B also calls function C in 3 different positions: at this time, there are 9 different contexts for function C. In addition, as the software scale increases and modular software design emphasizes the code reuse, the number of functions and the number of call sites increases significantly, and the number of contexts may expand to the extent that it cannot be analyzed. Lastly, it is difficult to analyze the function recursive calls performing precise contextual analysis of the results. Context analysis is also important for optimization and static scanning:
05 Other than compiler technology generating code for execution or VM, where else can it be used?
In addition to the above fields, compiler technology can currently be applied in:
- Static scanning tools where compilation analysis technology is used to find bugs in the code that does not conform to the required specifications.
- Code reformatting tools and refactoring tools based on the abstract syntax tree to improve program readability and maintainability.
- The IDE integrated program analysis can provide programmers intelligent tips and automatically fix ‘spelling’ errors in the code.
- Source compiler technology is applied to the source to various Domain Specific Language (DSL) processes.
- Software composition analysis to analyze which third-party components may be referenced by large-scale software, whether each of the third-party components has known security vulnerabilities, compatibility issues and license issues. This helps to avoid or fix such problems in the early stages of development.
06 Why is data flow analysis more effective in discovering program problems including errors and security vulnerabilities?
Taking another widely used AST-based inspection as a comparison object, AST-based inspection tools are commonly used to inspect the input source language’s lexical and pattern matching-based rules. For example, for the statement int x = y / 0;, the AST checker can detect that the statement contains an error that divides by zero. However, if the code is int x = y / z;, if z is somewhere in the program, if the assignment is 0 and the assignment statement can reach the statement for division, the program execution will also produce a division by zero error. The AST checker cannot detect this latter error because it is difficult to track the definition and use of variables and the propagation of values.
Similarly, when passing pointers across function boundaries, it is usually agreed that the caller or the callee, checks for null pointers. When the software is integrated with many modules developed by different organizations or open-source modules, inconsistent calling conventions between different modules will cause performance problems due to repeated checks or, missed checks will cause program errors and security issues. Since the related function call statements and the null pointer check statements are in different abstract syntax trees, such checks are also difficult to implement by the AST checker. The combination of data flow analysis and context analysis can detect such errors.
Common errors or security vulnerabilities in programs can often be summarized via a ‘Source-Sink’ model. That is, the data that causes errors or security vulnerabilities comes from a problematic source and goes through a series of flows such as being copied to another variable, written into the memory and then read out, etc. This will eventually cause a program error in the sink. Take the common security vulnerability Use After Free (UAF) for example. Starting from the statement (Source) where the memory pointed to by the pointer is released. After a series of data flow operations, if the pointer or its alias pointer can be reached the statement in which the pointer is dereferenced (Sink) will trigger a Use-After-Free error. To detect this type of problem, a specific tag can be tied to a particular statement or variable (Source). The tag can be propagated along the flow using compiler analysis techniques. Errors can be detected in statements or variables (Sink) by checking the presence or absence of the particular tag. This type of analysis can detect such problems efficiently and accurately.
Since the birth of major compiler technology like control flow graph and data flow analysis in the 1970s, the main analysis and optimization techniques were created in the decades leading up to the millennium. In the past two decades, compiler technology has taken a back seat in the worlds of academia and industry, but there have been some technical groups that have continued to cultivate the field with the hopes of positively impacting software advances. For example, Xcalscan, a static code analysis tool, uses compiler technology to analyse source code for defects and vulnerabilities early in the software development cycle. This enables companies to improve development efficiency and reduce time and costs.
Author:Jianxin Lai, Head of R&D, graduated from the Department of Computer Science of Tsinghua University, has extensive experience in compiler optimization and advanced static analysis.
Click here for more articles, whitepaper, and document downloads.