Ðóñ Eng Cn Translate this page:
Please select your language to translate the article


You can just close the window to don't translate
Library
Your profile

Back to contents

Software systems and computational methods
Reference:

Ambiguous Results when Using Parallel Class Methods within the .NET Framework


Gibadullin Ruslan Farshatovich

ORCID: 0000-0001-9359-911X

PhD in Technical Science

Associate Professor of the Computer Systems Department of Kazan National Research Technical University named after A.N. Tupolev-KAI (KNRTU-KAI)

420015, Russia, Republic of Tatarstan, Kazan, Bolshaya Krasnaya str., 55, office 432

rfgibadullin@kai.ru
Other publications by this author
 

 
Viktorov Ivan Vladimirovich

Postgraduate Student of the Computer Systems Department of Kazan National Research Technical University named after A.N. Tupolev-KAI (KNITU-KAI)

420015, Russia, Republic of Tatarstan, Kazan, Bolshaya Krasnaya str., 55, office 432

victorov.i.vl@yandex.ru
Other publications by this author
 

 

DOI:

10.7256/2454-0714.2023.2.39801

EDN:

UGEGOO

Received:

17-02-2023


Published:

08-03-2023


Abstract: Parallel programming is a way of writing programs that can run in parallel on multiple processors or cores. This allows programs to process large amounts of data or perform more complex calculations in a reasonable amount of time than would be possible on a single processor. The advantages of parallel programming: increased performance, load sharing, processing large amounts of data, improved responsiveness, increased reliability. In general, parallel programming has many advantages that can help improve the performance and reliability of software systems, especially with the increasing complexity of computational tasks and data volumes. However, parallel programming can also have its own complexities related to synchronization management, data races, and other aspects that require additional attention and experience on the part of the programmer. When testing parallel programs, it is possible to get ambiguous results. For example, this can happen when we optimize concatenation of float- or double-type data by means of For or ForEach methods of the Parallel class. Such behavior of a program makes you doubt about the thread safety of the written code. Such a conclusion can be incorrect and premature. The article reveals a possible reason for ambiguity of the results received by a parallel program and offers a concise solution of the question.


Keywords:

Parallel programming, Programming language CSharp, Multithreading, Rounding errors, Variability of results, Thread safety, Real numbers, Type decimal, NET platform, Class Parallel

This article is automatically translated. You can find original text of the article here.

Introduction In scientific and engineering calculations, accuracy is an important criterion determining the reliability of the results.

When performing mathematical operations and calculations on a computer (electronic computer), there is a need for rounding numbers, which can lead to errors. Therefore, reducing the calculation error associated with the use of rounding is an important task in scientific and engineering practice. Rounding of numbers occurs as a result of limiting the number of significant digits after the decimal point, which inevitably leads to a loss of accuracy.

Depending on the task and the required accuracy, it is necessary to choose the correct way of rounding numbers. For example, when rounding to an integer, some decimals will be lost, and when rounding to a certain number of decimal places, precision will be lost due to discarding the remaining decimal digits. Reducing the calculation error can be achieved by using more accurate calculation methods, such as the Gauss method or numerical integration methods.

In addition, to reduce the error, various algorithms and computing techniques can be used to reduce the loss of accuracy when rounding numbers. Reducing the calculation error is especially important in cases where high accuracy is required, for example, when using methods for solving systems of linear equations [1], when using numerical methods of differentiation and integration [2], when modeling physical phenomena [3], when calculating mathematical functions [4], when calculating interest rates or investment returns [5,6], when using machine learning and artificial intelligence algorithms [7].

In these cases, even small errors can lead to serious errors and incorrect results, which can negatively affect the decisions made. Thus, reducing the calculation error associated with the use of rounding is an important factor in scientific and engineering practice, since it allows you to obtain more accurate calculation results and reduce the likelihood of errors and incorrect conclusions.

 Parallel programming technologies are increasingly being used in modern computing systems to increase productivity.

However, when using parallel algorithms, additional difficulties arise [8,9] related to the need for data exchange between different processes, as well as synchronization of operations. In conditions of parallel programming, the use of rounding numbers can lead to additional errors associated with inconsistency of rounding between different processes.

This can lead to errors in calculations, which is especially important in cases where high accuracy of calculations is required. Therefore, reducing the calculation error associated with the use of rounding is of practical importance in parallel programming technologies.

To reduce errors in parallel algorithms, it is necessary to apply special algorithms and techniques. In particular, this is relevant in the course of parallel aggregation of local values, which determines the subject of this article. For demonstration purposes, the task is taken – the calculation of square roots, which is often used in various fields of science and technology, where accurate calculations are needed. Calculating the sum of square roots can be useful in solving many problems, such as:

  • Determining the distance between two points in n-dimensional space.
  • Calculation of the norm of a vector in n-dimensional space.
  • Calculation of the area of the curve on the plane.
  • Determination of the time it takes for a body to fall to the ground from a given height with acceleration of free fall.
  • Calculation of the cost of cargo delivery depending on the distance and weight of the cargo.
  • Determination of the time it takes for light to travel the distance between two points in space.
  • Estimation of the time required for passing tests and training in machine learning.
  • Evaluation of the speed of computer programs and algorithms.

 

Ambiguity of results during parallel aggregation of local values The Parallel.For and Parallel.ForEach methods offer a set of overloaded versions that work with a generalized type argument named TLocal.

Such overloaded versions are designed to help optimize the integration of data from cycles with intensive iterations. Below is an overloaded version of the Parallel.For method, which we will use later in the subject analysis. public static ParallelLoopResult For(

int fromInclusive,

int toExclusive,

Func localInit,

Func<int, ParallelLoopState, TLocal, TLocal> body,

Action localFinally); Where:

  • fromInclusive – initial index, inclusive.
  • toExclusive – the final index, not inclusive.
  • localInit is a function delegate that returns the initial state of local data for each task.
  • body is a delegate that is called once per iteration.
  • localFinally – a delegate that performs the final action with the local result of each task.
  • TLocal is the data type local to the stream.
  • The returned object is a structure (ParallelLoopResult) that contains information about the executed part of the loop. 

Let's apply this method in practice to sum up the square roots of numbers from 1 to 10 7. object locker = new object();

double grandTotal = 0;

Parallel.For(1, 10000000,

() => 0.0, // Initialize the local value.

(i, state, localTotal) => // Body delegate. Notice that it

localTotal + Math.Sqrt(i), // returns the new local total.

localTotal => // Add the local

{ lock (locker) grandTotal += localTotal; } // to the master value.

);

Console.WriteLine(grandTotal); This solution may produce an ambiguous result, for example:

  • 21081849486,4431;
  • 21081849486,4428;
  • 21081849486,4429. 

Thus, according to the results of the program launches, the calculation results may differ in 3 or 4 decimal places, which is unacceptable when solving tasks such as:

  • Scientific and engineering calculations related to the design and testing of new technologies and devices, where even small errors can lead to incorrect conclusions.
  • Development and analysis of financial models where the accuracy of the results can have a significant impact on investment decisions.
  • Working with large datasets in machine learning and data analysis, where the accuracy of the results may be important to obtain correct conclusions and forecasts.
  • Calculation of physical and chemical constants, such as Planck's constant or Boltzmann's constant, which are used in a wide range of scientific and engineering calculations.
  • Research in the field of astronomy and cosmology, where high accuracy of results can be important for determining the physical characteristics of stars, planets and galaxies. 

The reason for the ambiguity of the results is complex. Firstly, there are rounding errors of real numbers. Secondly, the execution of the delegate responsible for the formation of the local storage in the pool threads is of a piecemeal nature. Let's consider both in more detail. The float and double types internally represent numbers in binary form.

For this reason, only numbers that can be expressed in binary notation are accurately represented. In practice, this means that most literals with a fractional part (which are decimal) will not be represented exactly. For example: float x = 0.1f;  //

Not exactly 0.1

Console.WriteLine (x + x + x + x + x + x + x + x + x + x);    // 1.0000001 That is why the float and double types are not suitable for financial calculations.

In contrast, the decimal type works in the decimal system, so that it is able to accurately represent fractional numbers like 0.1, expressible in the decimal system (as well as in number systems with bases-multipliers of 10 – binary and fivefold). Since real literals are decimal, the decimal type can accurately represent numbers such as 0.1. However, neither double nor decimal can accurately represent a fractional number with a periodic decimal representation: decimal m = 1M / 6M;  // 0.1666666666666666666666666667M

double  d = 1.0 / 6.0;  // 0.16666666666666666 This leads to accumulating rounding errors: 

decimal notQuiteWholeM = m+m+m+m+m+m;  // 1.0000000000000000000000000002M

double  notQuiteWholeD = d+d+d+d+d+d;  // 0.99999999999999989 which disrupt the operation of equivalence and comparison operations: 

Console.

WriteLine (notQuiteWholeM == 1M);   // False

Console.WriteLine (notQuiteWholeD < 1.0);   // True Table 1 below provides an overview of the differences between the double and decimal types. 

Table 1. Differences between double and decimal types [9]

Characteristicdouble

decimal

Internal representation

Binary

Decimal

Decimal precision

15-16 significant digits

28-29 significant digits

Special values

+0, -0, positive (negative) infinity and NaN

Missing

Processing speed

Inherent in the processor

Not inherent in the processor (about 10 times slower than in the case of double)

Let's reveal the decimal type in more detail to answer the question why the processing of decimal data is not inherent in the processor. The binary representation of a decimal number consists of a 1-bit sign, a 96-bit integer, and a scaling factor used to divide an integer and specify its decimal part.

The scaling factor is implicitly a number 10 raised to a power in the range from 0 to 28. Thus, bits is an array of four elements consisting of 32-bit signed integers:

  • bits_0, bits_1 and bits_2 contain the low, medium and high 32 bits of a 96-bit integer.
  • bits_3:
    • 0-15 are not used;
    • 16-23 (8 bits) contain an exponent from 0 to 28, which indicates a power of 10 for dividing an integer;
    • 24-30 are not used;
    • 31 contains a sign (0 means a positive value, and 1 means a negative value). 

To calculate the sum of square roots using the decimal data type, some number representation parameters that may be significant include:

  • The number of digits in the fractional part – the more digits, the more accurate the representation of square roots will be, and the higher the accuracy of calculating the sum.
  • Rounding mode – if the result of the calculation has many decimal places, the rounding mode may affect the accuracy of the calculation of the sum. The rounding mode may affect rounding to the nearest even number or to the nearest odd number, which may affect accuracy.
  • Memory size – the larger the memory size, the more characters can be stored, and the higher the accuracy of calculating the sum.
  • Range of values – this may not be so significant, since only positive values are used in the problem of calculating the sum of square roots.

In general, to calculate the sum of square roots using the decimal data type, the parameters that are most significant are the number of digits in the fractional part and the rounding mode. These parameters can be selected based on the required accuracy and speed of calculations. Portion-based partitioning works by allowing each worker thread to periodically capture small “portions” of elements from the input sequence in order to process them [9].

For example (Fig. 1), the Parallel LINQ infrastructure starts by allocating very small portions (one or two elements at a time) and then increases the portion size as the query progresses: this ensures that small sequences will effectively parallelize, and large sequences will not lead to excessive cycles of full exchange. If a worker thread gets “simple” items (which are processed quickly), then eventually it will be able to get more portions. Such a system keeps each thread equally busy (and processor cores “balanced”); the only drawback is that extracting elements from a shared input sequence requires synchronization – and as a result, some overhead and competition may appear. Figure 1. Portion-based separation [9] 

The For method of the Parallel class works in a similar way, the only difference is that instead of an element of the input sequence, the iteration number acts, which is usually taken into account when executing the loop body (more precisely, an Action type delegate).

The implementation of separation is based on the mechanism of partitioning into portions, in which the portion size potentially increases in the case of positive dynamics of iteration processing. This approach helps to ensure high-quality load balancing with a small number of iterations and minimize the number of exclusive locks (during the assignment of iteration number ranges for worker threads) with a large number of them. At the same time, it is ensured that most of the iterations of the stream are concentrated in the same area of the iterative space in order to achieve high cache locality.

 

Investigation of the Parallel.For method for detailing the cause of the ambiguity of the final result The implementation of the For method is complex and requires detailed consideration, which is beyond the scope of this article.

Nevertheless, we note some points of the software implementation of the Parallel.For method with a generalized type argument. public static ParallelLoopResult For (int fromInclusive, int toExclusive, Func localInit, Func <int, ParallelLoopState, TLocal, TLocal>

body, …) {

return ForWorker(fromInclusive, toExclusive, s_defaultParallelOptions,

null, null, body, localInit, localFinally);

}

private static ParallelLoopResult ForWorker (int fromInclusive, int toExclusive, ParallelOptions parallelOptions, Action body, …) {

rootTask = new ParallelForReplicatingTask(parallelOptions, delegate {

if (rangeWorker.FindNewWork32(

out var nFromInclusiveLocal,

out var nToExclusiveLocal) &&

!sharedPStateFlags.ShouldExitLoop(nFromInclusiveLocal)) {

do {

if (body != null) {

for (int i = nFromInclusiveLocal; i < nToExclusiveLocal; i++) {

if (sharedPStateFlags.LoopStateFlags != ParallelLoopStateFlags.PLS_NONE

&& sharedPStateFlags.ShouldExitLoop()) {

break;

}

body(i);

}

}

}

while (rangeWorker.FindNewWork32(out nFromInclusiveLocal, out nToExclusiveLocal) …);

}

}, creationOptions, internalOptions);

rootTask.RunSynchronously(parallelOptions.EffectiveTaskScheduler);

rootTask.Wait();

}

internal ParallelForReplicatingTask(...) {

m_replicationDownCount = parallelOptions.EffectiveMaxConcurrencyLevel;

}

The rootTask method.RunSynchronously starts the execution of tasks in the pool worker threads, while the number of tasks is set by the ParallelOptions.efficiemaxconcurrencylevel property. The FindNewWork32 method defines the operating range for each pool thread. In the presented code, you can see that the execution of any task is not limited to the execution of the initially defined range, the pool threads continue to work for the newly specified ranges in the while statement. We will detail the work of the Parallel.For method with a generalized type argument on the previously presented example of summing square roots of numbers, expanding the code as follows.

 object locker = new object();

double grandTotal = 0;

ConcurrentBag<(int?, double)> cb1 = new ConcurrentBag<(int?, double)>();

ConcurrentDictionary<int?, long> cd = new ConcurrentDictionary<int?, long>();

ConcurrentBag<(int?, int)> cb2 = new ConcurrentBag<(int?, int)>();

var time = Stopwatch.StartNew();

time.Start();

Parallel.For(1, 1000,

() => { return 0.0; },

(i, state, localTotal) =>

{

cb1.Add((Task.CurrentId, localTotal));

if (!cd.ContainsKey(Task.CurrentId)) cd[Task.CurrentId] = time.ElapsedTicks;

cb2.Add((Task.CurrentId, i));

return localTotal + Math.Sqrt(i);

},

localTotal =>

{ lock (locker) grandTotal += localTotal; }

);

cb1.GroupBy(_ => _.Item1).Select(_ => new

{

TaskId = _.Key,

Iterations = _.Count(),

StartTime = cd[_.Key]

}).OrderBy(_ => _.StartTime).Dump();

var query = cb2.OrderBy(_ => _.Item2).GroupBy(_ => _.Item1, _ => _.Item2);

foreach (var grouping in query)

{

Console.WriteLine("TaskId: " + grouping.Key);

var r = grouping.GetEnumerator();

int? i = null;

bool onlyOne = true;

foreach (int iteration in grouping)

{

if (i == null)

Console.Write("{" + $"{iteration}");

else

{

if (iteration - i != 1)

Console.Write(",...," + i + "}, {" + iteration);

onlyOne = false;

}

i = iteration;

}

if (onlyOne) Console.WriteLine("}");

else Console.WriteLine(",...," + i + "}");

} The program code allows you to take into account:

  • ID of each task TaskId;
  • the number of iterations performed within each Iterations task;
  • startTime – the start time of each task, expressed in ticks by means of the Stopwatch class (one tick is less than one microsecond);
  • ranges of numbers of processed iterations of each task. 

For example, based on the results of the program on a machine capable of executing 8 threads in parallel at the hardware level, the following indicators can be obtained: TaskId, Iterations, startTime (Table 2). The ranges of numbers of processed iterations are presented in Table 3.

Table 2. Indicators of TaskId, Iterations, startTime after completion of the Parallel.For method

TaskIdIterations

StartTime

20

205

54568

21

1

54597

16

1

54709

22

159

54846

18

204

54986

24

161

55689

17

111

55689

15

1

55821

19

156

55880

Table 3. Ranges of numbers of processed iterations of each taskTask ID

 

 

 

 

Ranges of numbers of processed iterations

24

{1,...,31}, {40,...,55}, {88,...,103}, {120,...,124}, {142,...,173}, {206,...,221}, {266,...,281}, {330,...,345}, {484,...,496}

22

{32,...,39}, {56,...,87}, {104,...,119}, {126,...,141}, {174,...,189}, {222,...,237}, {250,...,265}, {298,...,313}, {346,...,361}, {993,...,999}

15

{125}

20

{190,...,205}, {238,...,248}, {282,...,297}, {314,...,329}, {362,...,372}, {858,...,992}

16

{249}

17

{373,...,483}

18

{497,...,620}, {716,...,731}, {746,...,761}, {778,...,793}, {810,...,825}, {842,...,857}

19

{621,...,715}, {732,...,744}, {762,...,777}, {794,...,809}, {826,...,841}

21

{745} According to the results of the program, you can see that the operating ranges are different.

Some tasks consist of a single iteration. But this is not a disadvantage of the algorithm by which the method under study is implemented, but a consequence of the fact that processing one iteration is a computationally labor-intensive procedure. So, for example, if you add the following lines to the target delegate method representing the fourth parameter of the Parallel.For method: for (int k = 0; k < 1000000; k++)

Math.Sqrt(k); thereby significantly complicating the processing of each iteration of the cycle, it is possible to obtain a uniform distribution of ranges across tasks (Table 4). 

Table 4. TaskId, Iterations, startTime indicators after complicating the processing of each iteration of the cycle

TaskId

Iterations

StartTime

13

79

50828

10

63

50849

12

79

51226

16

79

51698

15

79

52224

11

95

52788

19

108

53181

17

84

53640

14

79

53976

20

32

3263706

21

32

3355186

22

32

3462087

23

32

3543335

24

29

3562197

25

39

3575327

26

29

3639235

27

29

3797790

Thus, the results of testing the Parallel.For method show that during repeated launches of the program with this method, a different number of tasks and working ranges are created that differ from each other. This behavior of the program when processing float and double data leads to ambiguity in the result of the execution of the localFinally delegate, which determines the final action with the local result of each task. In order to ensure high accuracy of the calculations carried out, it is necessary to ensure the transition to the decimal type: 

object locker = new object();

decimal grandTotal = 0;

Parallel.For(1, 10000000,

() => (decimal)0,

(i, state, localTotal) =>

localTotal + (decimal)Math.Sqrt(i),

localTotal =>

{ lock (locker) grandTotal += localTotal; }

);

grandTotal.Dump(); Such a transition is associated with overhead costs for the program performance (when calculating the sum of the square roots of numbers from 1 to 10 7 on a quad-core Intel Core i5 9300H processor, the execution time is approximately 0.260 msec.

when using the decimal type, while when using the double type, it takes only 0.02 ms.) and may be unjustified due to the lack of need for increased accuracy results. However, instead, an unambiguous result is provided at the output: 21081849486.44249240077608. Conclusion 

The accuracy of calculations is of great importance in parallel programming, especially when solving complex scientific problems or tasks related to processing large amounts of data.

In parallel programming, when calculations are performed on multiprocessor systems, the accuracy of calculations may be impaired due to several reasons:

  • Data synchronization problems – When multiple processors process the same data, it may be necessary to synchronize calculations between processors so that the results are consistent and accurate.
  • Communication problems – when processes exchange data, there may be problems with delays and data leaks, which can lead to errors in calculations.
  • Problems with rounding accuracy – Some calculation operations can lead to loss of accuracy, especially when using floating-point data types. With distributed computing on multiprocessor systems, the loss of accuracy can be exacerbated.

In the article , special attention is paid to the latter reason by the example of using the For method of the Parallel class within the framework of the executing environment .NET Framework. The analysis of the Parallel.For method is carried out, the work of a parallel program for calculating the sum of square roots of a given range of numbers is investigated. It is revealed that the reason for the ambiguity of the calculation results is complex and is associated with rounding errors of real numbers and the fractional processing of a range of numbers in the pool streams. To reduce this ambiguity, it is advisable to use more accurate data types, and it is also of interest to use algorithms that reduce the impact of rounding errors and improve the consistency of results between streams [10].

References
1. X. Fan, R. -a. Wu, P. Chen, Z. Ning and J. Li, "Parallel Computing of Large Eigenvalue Problems for Engineering Structures," 2011 International Conference on Future Computer Sciences and Application, Hong Kong, China, 2011, pp. 43-46, doi: 10.1109/ICFCSA.2011.16.
2. Xuehui Chen, Liang Wei, Jizhe Sui, Xiaoliang Zhang and Liancun Zheng, "Solving fractional partial differential equations in fluid mechanics by generalized differential transform method," 2011 International Conference on Multimedia Technology, Hangzhou, 2011, pp. 2573-2576, doi: 10.1109/ICMT.2011.6002361.
3. R. Landau, "Computational Physics: A Better Model for Physics Education?," in Computing in Science & Engineering, vol. 8, no. 5, pp. 22-30, Sept.-Oct. 2006, doi: 10.1109/MCSE.2006.85.
4. H. Ali, A. Doucet and D. I. Amshah, "GSR: A New Genetic Algorithm for Improving Source and Channel Estimates," in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 54, no. 5, pp. 1088-1098, May 2007, doi: 10.1109/TCSI.2007.893507.
5. D. -M. Zhu, J. -w. Gu, F. -H. Yu, W. -K. Ching and T. -K. Siu, "How correlation risk in basket credit derivatives might be priced and managed?," in IMA Journal of Management Mathematics, vol. 32, no. 2, pp. 195-219, April 2020, doi: 10.1093/imaman/dpaa013.
6. J. Xu and Y. Shi, "Financial Leasing, Optimal Financial Structure and Economic Growth: An Analysis Based on Financial Inclusion Perspective," 2019 International Conference on Economic Management and Model Engineering (ICEMME), Malacca, Malaysia, 2019, pp. 147-150, doi: 10.1109/ICEMME49371.2019.00038.
7. S. N. Cherny and R. F. Gibadullin, "The Recognition of Handwritten Digits Using Neural Network Technology," 2022 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM), Sochi, Russian Federation, 2022, pp. 965-970, doi: 10.1109/ICIEAM54945.2022.9787104.
8. Albahari, Joseph. C# 10 in a Nutshell. " O'Reilly Media, Inc.", 2022.
9. Gibadullin R.F. Flow-safe calls of controls in enriched client applications, "Software Systems and Computational Methods", no. 4, pp. 1-19, 2022.
10. Rump, S.M. Fast and Parallel Interval Arithmetic. BIT Numerical Mathematics 39, 534–554 (1999). https://doi.org/10.1023/A:1022374804152.