Crane lifting Scala onto Code Property Graph to conduct vulnerability analysis

Image Courtesy :

The Scala language has continued to gain popularity over the last several years, thanks to its excellent combination of functional and object-oriented software development principles, and its implementation on top of the proven Java Virtual Machine (JVM). Although Scala compiles to Java bytecode, it is designed to improve on many of the perceived shortcomings of the Java language. Offering full functional programming support, Scala’s core syntax contains many implicit structures that have to be built explicitly by Java programmers, some involving considerable complexity.

Scala fuses object-oriented and functional programming in a type-safe way. From the object-oriented world, Scala takes the concept of class, and follows the principle that “everything is an object”. From the functional world, it brings in algebraic data types, pattern matching, anonymous functions and closures. Staying true to the above principle, algebraic data types are encoded as classes (case classes), pattern matches as partial functions or extractor objects, functions as generic interfaces of one method and finally closures as objects implementing a function interface

As Scala gains popularity, the need grows for program analysis tools for it that automate tasks such as vulnerability analysis, verification and whole-program optimization. Besides the abstract syntax tree and control flow, such tools typically need call graphs to approximate the behavior of method calls.

Several Scala features such as traits, abstract type members, and closures leading to indirect function calls make call graphs constructions even harder. Furthermore, the Scala compiler translates certain language features using hard-to-analyze reflection. While solutions exist for analyzing programs that use reflection, such approaches tend to be computationally expensive or they make very conservative assumptions that result in a loss of precision.

The ability to grow the language through libraries is a key aspect of Scala. Its syntax allows users to write code that looks like built-in features, keeping the language small. For instance, the standard library provides a BigInt class that is indistinguishable from the standard Int type, and the for loop on integers is provided through a Range class.

This approach is elegant and gives programmers the power of language designers. However, everything comes at a price, and in this case the price is complexity in security analysis and performance penalties. Familiar looking code, like an assert statement or a for loop may conceal unexpected costs. While library designers are usually aware of these implications, users are often surprised by such performance hits

Let us illustrate few features of Scala that make it suitable for developing, further leading to complexity in security analysis.

Higher-order functions

Scala supports higher-order functions and has convenient syntax for function literals. For instance, method foreach is defined like this:

def foreach[U](f: A => U) = // ..

Here, foreach takes a function from type A (we assume foreach is defined inside a generic collection of elements of type A) to U (most of the times U is instantiated to Unit). Iterating over such a collection is then done like this

xs foreach { x => 
// x is bound successively to each element in xs

Notice how infix notation makes foreach look like a built-in feature of the language. The ability to extend the language with new control structures demands a way to delay evaluation of terms. To this end, Scala provides call-by-name parameters, which allow programs to pass unevaluated arguments to a method. Such arguments are evaluated each time their value is needed.

Implicit parameters and views

When designing a library it often happens that an existing type has to be augmented with new methods. For instance, a DSL may want to reuse the primitive values of the host language, but specific functionality is (naturally) missing on those types. To enable after-the-fact extension of existing types, Scala proposes a mechanism based on implicit values and views.

A parameter marked implicit can be filled automatically by the compiler when the programmer does not provide an explicit value. All values in scope that are marked implicit are eligible.

implicit def imFn(str: String): Parser[String] = accept(str) // ..

For comprehensions

Scala provides an extensible way to iterate over collections by means of for comprehensions. A for expression is translated to a sequence of method calls to foreach, map and withFilter. In its most general form a for-comprehension contains any number of generators and filters. For instance,

for (i <- xs; 
j <- ys;
if (i % j == 0)) print (i, j)

prints i and j only when i is a multiple of j. The first two statements are generators, binding i and j to each element of xs and ys respectively. This is achieved by translating the given comprehension into a series of method calls:

i => ys.
withFilter(j => i % j == 0).
foreach(j => print(i, j)

Any type that has these methods can be used as a generator inside a for comprehension. Even more, Scala does not provide a for loop as in most imperative programming languages, instead it has a Range class in the standard library with the required methods. However, to the programmer it looks like the language has built-in support for iterating over integers.

for (i <- 1 to 10) print(i)


Traits are one of the cornerstone features of Scala. They provide a flexible mechanism for distributing the functionality of an object over multiple reusable components. Traits are similar to Java’s abstract classes in the sense that they may provide definitions of methods and in that they cannot be instantiated by themselves. However, they resemble Java interfaces in the sense that a class or trait may extend (i.e., “mix-in”) multiple super-traits.

object SuperTrait {
trait A {
def callMe = println("A.CallMe");
def implMe;
trait B {
def callMe;
def implMe = this.callMe;
trait C {
def callMe = println("C.CallMe");
def implMe;
def main(args : Array[String]) = {
(new A with B).implMe

The code snippet above shows an example program that declares a trait A in which a concrete method callMe and an abstract method implMe are defined. The program also declares a trait B that defines a concrete method implMe and an abstract method callMe. Lastly, trait C defines a concrete method callMe.

The program contains a main method that creates an object by composing A and B, and then calls implMe on that object. The allocation expression new A with B is equivalent to a declaration and instantiation of a new empty class with parents A with B.

Abstract Type Members and Closures

Scala supports a flexible mechanism for declaring abstract type members in traits and classes. A type declaration defines a name for an abstract type, along with upper and lower bounds that impose constraints on the concrete types that it could be bound to. An abstract type is bound to a concrete type when its declaring trait is composed with (or extended by) another trait that provides a concrete definition in one of two ways: either it contains a class or trait with the same name as the abstract type, or it declares a type alias that explicitly binds the abstract type to some specified concrete type.

Scala allows functions to be bound to variables and passed as arguments to other functions. Figure 3 illustrates this feature, commonly known as “closures.”

object Closures {
def fn1(y : () => A) = y();
def fn2(z : () => B) = z();
 class A;
class B;

def main(args : Array[String]) {
val c1 = () => { new A };
val c2 = () => { new B };

Challenges with Call Graph Construction

The preceding features show that Scala’s traits and abstract type members pose new challenges for call graph construction. Several other Scala features, such as path dependent types, type based generic programming, structural types, introduce further complications.

The presence of traits complicates the construction of call graphs because method calls that occur in a trait typically cannot be resolved by consulting the class hierarchy alone.

To overcome these challenges, the call graph should take the following factors into consideration

  1. Needs to make certain assumptions about how traits are combined. Then, for each of these combinations, one could compute the members contained in the resulting type and approximate the behavior of calls by determining the method that is selected in each case
  2. Rather than performing the analysis at the source level, it is recommended apply it after the Scala compiler has desugared the code by transforming closures into anonymous classes that extend the appropriate scala.runtime.AbstractFunctionN. Each such class has an apply() method containing the closure’s original code.
  3. If Scala source code is transformed to JVM bytecode post compilation and during packaging, it would significantly transform code that results in the loss of type information, causing the computed call graphs to become imprecise. Additionally, the Scala compiler generates code containing hard-to-analyze reflection for certain Scala idioms.

Mounting Scala language semantics onto Code Property Graph

The Scala compiler is organized in a sequence of phases, each one translating the input language into a simpler form, until the program is close enough to Java to make code generation simple. The front-end uses an abstract syntax tree (AST) that is passed between phases, while the backend uses a stack based IR intermediate representation. Most of the transformations are done on the AST, and many of the interesting ones, like lambda lifting (constructing environments for free variables in lambda terms) and mixin (mixin composition, a form of multiple inheritance based on traits) are performed after type erasure.

After constructing the AST and control flow, Data Flow Analysis (DFA) is conducted to track the type of local variables and stack elements at every point in a method. We use a classical forward data flow analysis, formulated in terms of a type lattice. We begin by defining the type lattice, which is composed of the 9 primitive types, user defined types and the class hierarchy, having the usual subtyping semantics.

Special care has to be taken for control-flow paths involving exceptions. There may be control-flow merge points at the beginning of an exception handler (for instance, when different basic blocks are covered by the same exception handler). The least upper bound in that case has to be the special exception handler stack, containing exactly one element, of the type of the exception being caught. This is a direct consequence of the semantics of Java exception handlers: when an exception handler is invoked, it has exactly one value on the stack (the exception that was thrown).

The code property graph is a concept based on a simple observation: there are many different graph representations of code, and patterns in code can often be expressed as patterns in these graphs. While these graph representations all represent the same code, some properties may be easier to express in one representation over another. So, why not merge representations to gain their joint power, and, while we are at it, express the resulting representation as a property graph, the native storage format of graph databases, enabling us to express patterns via graph-database queries.

Illustration of a code property graph from the original paper “Modeling and Discovering Vulnerabilities with Code Property Graphs”, where an abstract syntax tree, control-flow graph and program-dependence graph are merged to obtain a representation for querying code

This original idea was published by Nico Golde, Daniel Arp, Konrad Rieck, and Fabian Yamaguchi at Security and Privacy in 2014, and extended for inter-procedural analysis one year later. The definition of the code property graph is very liberal, asking only for certain structures to be merged, while leaving the graph schema open. It is a presentation of a concept, data structure, along with basic algorithms for querying to uncover flaws in programs. It is assumed that somehow someone creates this graph for the target programming language. In consequence, concrete implementations of code property graphs differ substantially.

The code property graph enables us to efficiently switch between intra- and inter-procedural data flow analysis which gives precise context sensitive results. Those results improve the call graph which in turn improves the data flow tracking. Furthermore the code property graph enables us to create an adhoc approximated type system which greatly reduces the amount of individual call site resolutions and is thus key to analyze Scala complaint bytecode in a reasonable amount of time.

ShiftLeft is an application security platform built over the foundational Code Property Graph that is uniquely positioned to deliver a specification model to query for vulnerable conditions, business logic flaws and insider attacks that might exist in your application’s codebase.

If you’d like to learn more about ShiftLeft, please request a demo.

Crane lifting Scala onto Code Property Graph to conduct vulnerability analysis was originally published in ShiftLeft Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

*** This is a Security Bloggers Network syndicated blog from ShiftLeft Blog – Medium authored by Chetan Conikee. Read the original post at:—-86a4f941c7da—4