Loop Structures Optimization and Reordering for Efficient Parallel Processing

The problem of choosing an optimal sequence of transformations, leading to the most efficient parallel version of a program remains an open question. Related to this, compilers of the moment only manage to incorporate a set of heuristic decisions. This article treats the transformation of the program, addressing and analyzing the range of transformations of loop structures, that we consider most appropriate today. We tried to exemplify these transformations in case of groups of companies


INTRODUCTION
With a rather complicated organizational structure, characterized by behavioral flexibility and lack of bureaucracy present in all sectors of industry, commerce and services, groups of companies easily adapts to changing economic and social conditions. In the idea of implementation of new market information and communication technologies, this article proposes a prototype for decomposing existing operations in groups of companies using parallel computing.
During the development of compiler theory, several changes of source code had been proposed to optimize the execution of programs. Most optimizations for sequential cases intend to reduce the number of instructions executed by the program using transformations, based on a quantitative analysis of the values conveyed in the program and data flow analysis. In addition, recent optimization for parallel execution maximizes parallelism and data localization in the memory, using transformations based on the characterization of arrays and data dependency analysis results. The stages which must be completed by a compiler to perform optimizations are the following: a) Selection of the part of the source program which shall be optimized and the appropriate processing of a particular purpose; b) Checking if the semantic transformation ensures consistency; c) Transformation of the program. Techniques for data dependency analysis are used for steps a) and b). The selection stage is the most difficult and insufficiently treated topic in the current compiler theory. Due to high costs involved in a full analysis of optimization possibilities, compilers typically restrict their range of action to some transformations considered more efficient by their builders. On the other hand, there may be sequences of transformations that have the opposite effect. For example, an attempt to reduce the number of instructions executed may ultimately reduce performance because of improper use of caching. Architectures become more complex, because of significantly increasing optimization directions and decision making related to the range of transformations is very complicated.

Operators Reduction
Reduction of operators aims for replacing a loop expression with an equivalent expression that uses a less expensive operator ( [4], [6]). Based on the following loop structure which containing a multiplicity, Even if, most of the time, reducing the operators is accompanied by the introduction of an additional variable, time saving is achieved by the loop processing significantly. It is justified to put this transformation in the category of optimization execution. The most common use of operator reduction is the reduction of expressions that contain induction variables ( [3], [4], [6]). Table 1 presents various possibilities of reduction of operators. It is assumed that the operation in the first column occurs in a loop with index i from 1 to n at the time of processing and the compiler initializes a temporary variable T in the expression of the second column. Operation inside the loop is replaced by the expression in the third column, and the value of T is updated every iteration with the value in the fourth column. The positive effects are obvious.

Expression
Initializing Using Updating x / c T = 1/c x * T S e p t e m b e r 2 7 , 2 0 1 5

The Elimination of Induction Variable
A variable whose value is derived from the number of iterations that were executed by a loop is called induction variable. The control variable of a "for" loop is the most common type of induction variable, but there are other variables with this property. The example below illustrates this in case of induction variable j. Induction variable elimination simplifies the analysis of index expressions in the data dependency tests, as it is explained in the example below, where, after removing the variable j, the analysis is based only on the values of the loop variable i and constant n.

Factorization of Loop Invariants
When an operation occurs inside a loop, but its result is not changed between iterations (loop invariant), the compiler can transfer that computation outside the loop ( [3]). We give below a code sequence in which a transcendental function of an expensive call is transferred outside the loop. The test that appears in the transformed code ensures that if the loop is not executed again then the transfer code is not run either, to prevent triggering an exception.

Externalization of Conditional Instructions
This method is applied to loops that contain a conditional instruction with invariant test in the loop. The loop is then replicated at each conditional branch instruction, thus avoiding the disadvantage of conditional branching within the loop, reducing code size representing the loop body and making possible the parallelization of a possible conditional branch instruction as Allen remarks [5].
Conditional instructions that are subject to outsourcing can be detected while analyzing the possibilities of factoring, a process that identifies loop invariants.
In the following example the variable X is loop invariant, allowing the loop to be subjected to the operation of outsourcing and the true branch to be executed in parallel, as shown in the converted code. Notice that, like the factorization of loop invariants, if there's a chance to trigger an exception condition assessment, this should be prevented by a test of the possibility of execution. In a double nested cycle in which the inner loop has unknown limitations, if the code is generated directly, there will be a test before the inner body of the loop, to determine whether or not it will be executed. The test for the inner loop will be repeated each time the outer loop is executed. When the compiler uses an intermediate representation for the program, then test is explicit and outsourcing can be used to transfer this test outside the outer loop [27].

Iteration Reordering Transformations
In this section we describe the transformations that alter the relative order of execution of the iterations of nesting cycles. These changes are mainly used to highlight the opportunities for parallelization and locating data in memory. Some compilers use reordering transformations only for perfectly nested cycles. To increase opportunities for optimization, compilers can sometimes use fission to extract perfectly nested cycles of imperfect nesting. The compiler determines whether a loop can be executed in parallel, examining the associated dependencies induced by loop iterations. If all of the loop dependency distances are 0, this means that there is dependency carried over iterations in the loop.
We give below an example where the loop distance vector is (0,1), this way the outer loop may be parallelized ( figure  2).
More generally, the p-th loop of a a nested structure of cycles may be parallelized for any distance vector V = (v1, …, vp, …, vd ), vp = 0 ∨ q < p : vq > 0 In the case of (b) the distance vectors are {(1,0), (1, -1)}, so that the inner loop may be parallelized. Both references from the right part of the expression accesses in reading items on line i-1 of the a array, elements updated in the previous iteration of the outer loop. The i-th line items may be calculated and stored in any order.

Loops Interchange
This transformation changes the position of the two loops of a PNL (Perfectly Nested Loop), moving usually one of the outer loops innermost position ( [7], [34]). Interchange is considered one of the most powerful transformation and can improve performance in many ways. It is used mainly to:  allow vectorising by exchanging an interior loop that manifests dependencies, with one exterior loop, independently;  improve vectorization, loop by shifting largest independent position within the loop;  improve the performance of parallel execution by transferring an outside independent cycle nested loop, thus increase the granularity and reduce the number of iterations required barrier synchronizations;  to reduce looping step, preferably to 1;  increase the number of expressions in the loop invariant cycle innermost.
We should consider that these benefits do not exclude each other. For example, a swap which improves the reuse grade of the registers can modify an access pattern with step 1 into an access patern with step n, which may have a much lower overall performance, due to a much larger number of mismatches in the memory cache. In the following example, the inner loop accesses array a with n steps. We use the convention of storing the array elements in columns. With loop exchanging, we convert inner cycle into a cycle where accessing step = 1. For a large array, for which more than one column fits in the cache memory, the optimization reduces the number of cache misses for a, from n2 to n2 * de / dl, where de is the dimension of the element and dl the dimension of the line.
Anyway, the original loop permits total[i] to be placed in a register, eliminating the load/store operations from inner loop.
This way, the optimized version increases the number of operation load/store for total from 2n to 2n2. If the array a fits in cache, it proves that the original loop is more advantageous. For a vectorial architecture, the transformed loop allows vectorization by eliminating dependency on total [i] in the inner loop.
Interchange of the cycles is legal when changed dependencies are legal and looping limits can be interchanged. If two loops, p and q, from a PNL of d loops  The looping interchange of the limits is a simple operation when the iteration space is rectangular, as in the previous PNL example.
In this case the limit cycles are independent of indices inside the loop containing it, and the two can simply be interchanged. When the iteration space is not rectangular, calculation of new limit of looping becomes more complex.
In programming often triangular spaces and even trapezoidal are used. Cycles often occur imperfectly nested whose management requires more complex techniques. Some of these issues are addressed in detail in ( [36], [39]).

The Interior Cycle Translation
Translation of interior cycle (loop skewing) is a transformation especially useful in combination with interchanged cycles ( [22], [26], [39]). Translation has been introduced to solve the so-called calculation type: "wave crest" (wave front computations). It is called like this because it updates array elements as a wave propagates through space iterations. S e p t e m b e r 2 7 , 2 0 1 5 Translation is performed by adding the outer loop index multiplied by a factor of translation, f, and variable limits inner iteration, followed by reduction using the same values for each variable in the inner iteration cycle. Since looping limits change accordingly and use index variables to compensate, translation does not change the semantics of a transformation program and is always legal.
The cycle from figure 5 (a) may be subject to interchangeability, but the loop can not be parallelized, because of a dependency on both inner loops, (0,1) and in the outer (1.0). This graph is expressed by the existence of edges on the horizontal ((0,1)) and vertical ((1,0)).
The result of the translation by f = 1 is shown in figure 5 (c-d). Transformed code is equivalent to the original, but the effect on space iterations aligning "wave peaks" (diagonals) original nesting cycle (that is diagonal from right to left are vertical lines), so that for a given value of j all i iterations can be executed in parallel (because there is no vertical dependency arcs, iterations for a fixed j don't depend on one another).
To highlight this parallelism, loop structure must also be translated to subject interchangeability. After translation and interchange, the cycle has a nested distance vectors {(1,0), (1,1)}. The first dependency allows the inner loop to be parallelized, because the corresponding dependency distance is 0. The second dependency allows the inner loop to be parallelized because it is a dependency in report to previous iterations of the outer loop.
Translating can highlight parallelism for a nesting of two cycles with the set of distance vectors (Vk) if: When we translate with factor f, the originally distance vector (v1, v2) will become (v1, fv1 + v2). For any dependency with v2  0, the scope is to find f so that fv1 + v2  1. Correct translation of factor f is calculated by taking the S e p t e m b e r 2 7 , 2 0 1 5 maximum of fi = ⌈ (1 -vi2) / vi1⌉ in relation to all the dependencies (Kennedy 1993). The interchanging of translated loop is complicated because their looping limits depend on the loop iteration variables.
An alternative method for treating calculations of wave peak is super-node partitioning ).

Reversing the Looping Limits
This transformation changes the direction of the cycle space through its iterations. It is often used in combination with other reordering transformations of space iterations, because it changes depending vectors. As independent optimization, reverse looping can reduce loop overhead by eliminating the need for a comparison instruction on architectures without a comparison-branching instruction such as Alpha [32].
The cycle is reversed so that the iteration variable decreases to zero, allowing the loop to end an instruction of type BNEZ (branch if not equal zero). If loop p from a nesting of d loops is inverted, then for each dependency vector V, the element vp is denied. Reversal is legal if each vector result V' is lexicographically positive: if vp = 0 or  q < p: vq > 0.

Changing the Cycle Granularity
Changing the granularity of a cycle (strip mining) is a method for adjusting the granularity of an operation, especially a parallelized operation ([2], [7], [24]). The original definition of this operation involves transforming a one-dimensional cycle to two-dimensional cycles. A dependency on (d) becomes (0, d) and (1, d -s -1), where S is the step value access (strip size). The transformation is always legal in the sense that it will induce negative dependencies in the transformed loops. But it is justified only if S ≥ d, otherwise it has no positive effect. Changing the granularity is usually performed for the execution on vectorial machines, to make an efficient exploitation of the size of the machine registers. We present an example below. Calculation with changed granularity is expressed in matrix notation and it is equivalent to a forall loop. If the iteration's length is not divisible by the step size, then additional changes are needed. For this purpose we use a so-called cleanup code [7]-as in the case for the last loop from figure 7 (b).
One of the most common uses of granularity change is choosing the number of independent calculations in the inner cycle of a nested loop structure. For example, in a vectorial machine, the serial cycle can be converted to a series of operations on arrays, each array consisting only of the unit of granularity.
Changing the granularity is also used for some compilations of type SIMD [34] to combine operations in a loop and send on distributed memory multiprocessors [13] and temporarily limit the size of pictures generated by the compiler ( [1], [39]).
Changing granularity often requires other changes. Cycle decomposition may reveal simple cycles nested within a cycle that is too complex to undergo to operation of changing granularity. Interchanging of cycles can be used to move a parallelized loop in the inner position or nested cycle, to maximize granularity unit size.
The above examples demonstrate that the granularity changing can create a bigger processing unit, from smaller ones.
Transformation can also be used in the opposite direction, reducing the initial granularity, if execution efficiency is necessary.

Shrinkage of Loop
The contraction of a loop (cycle shrinking) is a special case of changing granularity. When a cycle displays dependencies which cannot be executed in parallel (i.e. to be converted into a forall statement), the compiler can still detect a certain degree of parallelism possible that the distance dependency is greater than 1.
In this case, the contraction will convert a serial cycle into an external serial cycle and internal parallel cycle [28]. Contraction cycle is especially used to highlight fine granularity parallelism.
For example, in figure 8 (a), a [i + k] is updated in iteration i, and accessed in reading in the iteration i + k, depending on the distance k. As a result, the first k iterations can be executed in parallel only with the condition that none of the following iterations begin the execution until the first k were not finished. The same thing is then carried out with the following k iterations, as shown in figure 8 (b). Space iteration dependencies are shown in figure 8 (c): each group of k iterations is thus dependent only from the previous group.

Fig. 8. Space iteration dependencies
The result is, potentially, an increasing of the speed with k factor, but this k value is usually small (2 or 3). So, this optimization is typically limited, to highlight the parallelism which can be made at the instruction level, for example, by carrying out processing cycles. Note that the value of k must be constant in the cycle, and the compiler must know at least that it is positive.

Dividing Iteration Space
Dividing (loop tiling) is a multidimensional generalization transformation amending granularity. Dividing is primarily used to improve the reuse of the cache, dividing the iteration space into so-called divisions (tiles) and transforming nested cycle to iterate over them ( [2], [12], [21], [39]). Also, the transformation can be used to improve the location of the data to the CPU, registers or memory pages. S e p t e m b e r 2 7 , 2 0 1 5

Fig. 9. The division of iteration space
The need of using this transformation is illustrated in the loop from figure 9 (a) that assigns to a, the transpose of b. The j loop is the most interior, the access to b is made with step 1, while the access to a is with n step.
The interchange is not helpful because it accesses b with n steps. Iterating over divisions (tiles) of space iterations, as it is shown in figure 9 (b), the cycle uses each line of the cache. The two inner cycles of matrix multiplications also have such a structure, the division being necessary to obtain runtime efficiency in dense matrix multiplication.
A pair of adjacent cycles can be divided if it can be legally interchanged. After division, the outer pair of cycles can be shifted to improve data localization at the division level and inner cycles can be interchanged to exploit parallelism and data locality cycle at the registry level.
Dividing can be expressed as an increase of the granularity of a single iteration of the collections of iterations (this collection actually represents division), outer looping having the mission to scroll divisions and the inner are responsible for correctly completing iterations in a division.

The Looping Decomposition
The decomposition (also called fission cycle -loop distribution, loop fission or splitting) divides a loop structure in several ones. Each new iteration loop has the same space as the original, but contains only a subset of its instructions ( [20], [26]).
The decomposition is used to: • create perfectly nested cycles; • create sub-cycles with fewer dependencies; • improve instruction cache allocation due to lower dimensions of cycles; • reduce memory requirements, iterating over fewer arrays; • increase the reuse grade of registers. Figure 10 is an example in which decomposition removes dependencies and allows parts of a cycle to be executed at the same time. The decomposition can be applied to any cycle, but all the instructions which belong to a cycle of dependency (called block  , [20]) should be placed in the same loop, and if S1 precedes S2 in the original loop, the loop containing S1 must also precede the one that contains the statement S2. If the loop contains a control flow execution, conversion application can show opportunities of decomposition. An alternative is to use a control flow graph of dependencies [19].
A specialized version of these transformations is the so-called decomposition by name, first called horizontal decomposition partitions by name [2]. Instead of a comprehensive analysis of data dependencies, the loop instructions are partitioned into mutually exclusive sets accessing variables. To those instructions is guaranteed their independence. When S e p t e m b e r 2 7 , 2 0 1 5 the arrays are large, the decomposition by the name may increase the amount of localization of data in the cache memory. Note that the above loop can't be decomposed using fission by name because the same instruction accesses the array a.

The Fusion of Loops
Reverse transformation of the decomposition is the fusion can improve performance by: • reducing delays due to overhead looping (loop overhead); • increase the level of instruction parallelism; • improving data localization at the level of registries, cache or memory pages [1]; • improving the load balance for parallel execution cycles.
In figure 10, decomposition allows partial parallelization of the cycle. The merging of the two loops improve the location registers and cache, because a[i] does not have to be loaded only once. The fusion also increases the degree of instruction-level parallelism by increasing the ratio of floating-point operations and integer values in the loop structure and reduces the overhead of the second cycle's time. If n is large, the split-cycle to run faster would be a vector machine, while fused cycle should be less in a superscalar machine.
In order to be able to merge two cycles, they must have the same limits. If the limits are not identical, it is sometimes possible to do the same through their adjustment (suggestive technique called a loop peeling) or by introducing conditional expressions in the loop body.
Two loops with the same limits can be merged if there aren't two instructions, S1in the first loop and S2 in the second loop, so that they would have a dependency S2 → S1 with the direction < in merged loop. The reason why this would be incorrect is that before the merge, all instances of S1 are executed before any instance of S2. After the merge, the corresponding instances are executed together. If there is an instance of S1 that has a dependency that must be executed after an instance of S2, the merger changes the order of execution, as it is shown in figure 11. Fig. 11. Two loops containing S1 şi S2 cannot merge if S2 → S1 in the looping structure obtained after merge.

Case Study
Let us consider the group of companies G which have a mother-firm and n subsidiaries named S1, S2, …, Sn. The reducing of operators is applied in the stage of aggregation of accounts. Mother-firm cumulates all the accounts like in figure 12 (multiplication operation and its transformation in the addition operation).

Fig. 12. The reduction of operations in mother-firm
In the same group we can reduce the variable of induction like in the next figure. We start with 2 variables, i and j, we reduce j and we finally have only i. In the stage of consolidation of accounts of the group of companies, we can reduce the number of variables by eliminating the loop variable j. The operations of the group companies will be reduced to the elimination of mutual accounts, eliminating mutual operations and eliminating reciprocical results using only one variable loop.
In the stage of factorizations of loop invariants, the incomes of subsidiaries which are reflected in mother-firm can be reflected like in Figure 14. In an article written by Chung in 1990 [10], the authors present a formal mathematical framework which unifies the existing loop transformations. This model gave us the idea to apply these transformations in accounting for groups of companies.
The article of Vivek describes a general framework for representing iteration-reordering transformations. These transformations are a special class of program transformations and change the execution order of loop iterations. Fernandez et al. in 1995 [11], in their article, present a method for code transformation using non unimodular transformations. We described a synergetic model to that presented by Fernandez et al. Jacobson et al. in their article describe current dependency analysis tests that can be used to identify ways for transforming sequential C code into parallel C code. Quing: "To optimize complex loop structures both effectively and inexpensively, we present a level loop transformation, dependency hosting, for optimizing arbitrarily nested loops, and an efficient framework that applies the new techniques to aggressively optimize benchmarks for better localization". Jain et al. [17] tell us: Based on important theorems, algorithmic methods are developed for program transformation to improve cache performance. A remarkable article is the one from the paper of Louis et al. [23] is a representative material in which the authors bring together algebraic, algorithmic and performance analysis results to design a tractable optimization algorithm over a highly expressive space.
Our work, aims to address a new concept of integrating groups of companies in parallel computing. This can be done easily by transformation of structures looping: optimization and reordering. In literature this approach has not been found.

Conclusions
The selection of transformation of looping structures, such as optimization and reordering, is a complex problem. In this article we have presented and analyzed the most important and commonly used transformations at the level of looping. They are useful in the context of automatic parallelization, although it is interesting to note that some changes were originally introduced as optimization of sequential execution. A future article will contain transformations of reordering iterations; they proved to be really specific purpose to highlight the inherent parallelism in the sequential programs. We tried to apply these transformations in the economics and management groups of firms, their complex activity requiring most often parallelization and business transformation for better organizational management.