Fortran Discourse: Function Overhead

Date: 2022-05-08
categories: fortran;

Contents

Investigation

There was an interesting question on the Fortran Discourse channel in the past couple of days and I decided to do a little digging. The gist of the issue is this:

  • Skipping the call to get_pointer all together: 10-11 seconds
  • Using the version without the class(moa_view_type) argument: 19-21 seconds
  • Using the version with the “view” argument: 44-46 seconds

struggling-with-a-strange-performance-issue/3372

This is the sort of problem that I enjoy investigating, and since it's a rainy Sunday morning I took an hour for a fun exploration.

My first thought was that maybe there was a pass-by-value situation going on and the view param was getting copied, which in a tight loop could definitely hamper performance. This seemed like a good time to crack out my assembly language skillz and do some investigation. It's important to know that I'm not a Fortran programmer but I do have a lot of appreciation for the revival Fortran is having lately. Since I'm not an expert, it's certainly possible for me to have missed some important details. On the other hand, the generated assembly code cuts past several layers of high-level abstraction.

I copied the OP's functions and the signatures are below:

subroutine get_pointer_dummy(idx, elem, found)
subroutine get_pointer(view, idx, elem, found)

The question revolves around the first parameter (view). Why does adding that parameter cause performance to be so much slower?

I always find it useful to create a MRE (minimal reproducible example) of a problem. I use this technique a lot in my day job, and it's one of the major reasons why I find unit tests useful. I set out to create an example for this case, even though I don't really know much about Fortran's syntax.

I set up a simple example with fpm, the Fortran Package Manager. Itself written in Fortran and in alpha state, fpm is inspired by Rust's cargo and other similar modern build tools.

Results

I performed the tests in Windows 10 Pro with gfortran 11.2.0 (mingw compiled from source from the nuwen distribution).

I wasn't able to reproduce the OP's findings of vastly slower performance, but I did find some interesting differences:

# Unoptimized
 :: testing get_pointer
Wall clock (s): 0.656000
CPU time (s):   0.656250

 :: testing get_pointer_dummy
Wall clock (s): 0.578000
CPU time (s):   0.578125

# Optimized with -O3
 :: testing get_pointer
Wall clock (s): 0.281000
CPU time (s):   0.281250

 :: testing get_pointer_dummy
Wall clock (s): 0.281000
CPU time (s):   0.281250

So this is not the huge slowdown noted by Arjen, but there was a measurable difference between the "dummy" and regular version, even though the code inside the subroutine was identical. You can see the functions here:

https://github.com/matthew-macgregor/fort-lang-disc-3372/blob/721977ae87e371bd258254682e41e1383ebd300d/src/module.f90#L53

Digging into the generated assembly, the optimized version was only two instructions different between the code for the loop and the code for subroutine (basically, the necessary handling of the argument passing).

The unoptimized version was substantially different: 7 more instructions were generated between the code for the loop and that of the subroutine for get_pointer_dummy. I believe that explains the difference in performance.

unoptimized_get_pointer.s
unoptimized_get_pointer_dummy.s

Instructions are in the README, but for ease of reading these were the commands I used:

# Unoptimized
gfortran -masm=intel -S src\module.f90 -o asm\main.s
gfortran -masm=intel -S app\main.f90 -o asm\main.s -J build\gfortran_2A42023B310FA28D

# Optimized
gfortran -masm=intel -O3 -S src\module.f90 -o asm\module_o3.s
gfortran -masm=intel -O3 -S app\main.f90 -o asm\main_o3.s -J build\gfortran_2A42023B310FA28D

Conclusion

I wasn't able to reproduce the OP's experience, and what I did observe seems consistent with the generation of unoptimized assembly code. There doesn't seem to be anything unexpected, but also doesn't provide an explanation of the drastic difference that Arjen noticed.