Help wanted speeding up code

AndrewC · February 15, 2022, 3:53am

We have some code that has become a bottleneck as the data size has increased. I know that how things are written can affect the speed of code execution in GAMS so wondered if someone could help us.

We have annual data in the database that contains the assumption that if data is missing then the last value with data should be copied down all subsequent years. So for example if we entered data for years 2025 and 2026, but our years went up to 2050, then we would take the last known data value (the value for 2026) and copy it down into 2027…2050. We’re calling this “auto populate” the data.

We have created some helper sets and do the auto populate with a variety of loops. But as the dataset has grown, mainly the number of elements in our dataset rather than the years, it has slowed to a real bottleneck. The slowness is not in creating the helper sets, it is in the loops which fill the data. Our data is something like:

set g = items we have annual data for.
set ay = all years
sets ay2,ay3 = alias for ay.
parameter MY_DATA(g,ay) = our data
reverse_ay(ay,ay2) = helper set that lets us access years in reverse order.
year1_lt_year2(ay,ay1) = help set that maps all years ay1 that are higher than the year ay
scalar FINISHED = helper scaler for effectively exiting loops
scalar HORIZON_LAST_YEAR_ORD = the last year we want to auto populate data for in this particular run (sometimes we want to run a very small subset of years rather than going out to the full end of the horizon)

Some example data to show what we get and what we want. Just small amounts of data shown for demonstration. Assume we have 2025…2030

reverse_ay(ay,ay+[card(ay)-2*ord(ay)+1]) = yes;

reverse_ay():
2025,2030
2026,2029
2027,2028
2028,2027
2029,2026
2030,2025

Some examples of year1_lt_year2():
2025,2026
2025,2027
2025,2028
2025,2029
2025,2030
2026,2027
2026,2028
2026,2029
2026,2030
etc

If we were given the data:
MY_DATA(x,2025) = 10
MY_DATA(x,2026) = 20
MY_DATA(y,2025) = 30

we would want to auto populate all years > 2026 for x with 20, and all years > 2025 for y with 30, so we would end up with:
MY_DATA(x,2025) = 10
MY_DATA(x,2026) = 20
MY_DATA(x,2027) = 20
MY_DATA(x,2028) = 20
MY_DATA(x,2029) = 20
MY_DATA(x,2030) = 20
MY_DATA(y,2025) = 30
MY_DATA(y,2026) = 30
MY_DATA(y,2027) = 30
MY_DATA(y,2028) = 30
MY_DATA(y,2029) = 30
MY_DATA(y,2030) = 30

Below are the loops we use. I am hand typing these in so there might be a typo, but its the general principal of how to speed these up that we need help with.

loop (g)$(sum(ay, MY_DATA(g,ay)) > 0) do { only enter this loop if we actually have any data at all for this g element }
    FINISHED = 0;
    { Loop backwards from the last year we are interested in until we find the first bit of non zero data }
    loop (reverse_ay(ay,ay2))$((NOT FINISHED) AND (ord(ay2) <= HORIZON_LAST_YEAR_ORD)) do
      if (MY_DATA(g,ay2) > 0) then
        { Found very last non zero entry, it is in year ay2 }
        FINISHED = 1;
        { If this isn't the very last year, copy the ay2 value down onto each subsequent year ay3 in the model horizon, but don't bother auto populating further than the last year we are interested in }
        if (ord(ay2) < HORIZON_LAST_YEAR_ORD) then
          MY_DATA(g,ay3)$(ay1_lt_ay2(ay2,ay3) AND (ord(ay3) <= HORIZON_LAST_YEAR_ORD)) = MY_DATA(g,ay2);
        endif;
      endif;
    endloop;
  endloop;

Can anyone see a faster way to do this? Our data is getting to the stage where we have approximately 100,000 elements in set g, and that’s growing as we speak. The size of set g dwarfs set ay, which would typically be 50 years. And we have more than just one parameter where we want to do this, and some of them have 3 or 4 indexes rather than just 2 (e.g. MY_DATA2(g,h,i,ay))

Perhaps it might be something like testing for HORIZON_LAST_YEAR_ORD all the time is way slower than just populating data in the horizon we wont end up using. But I don’t know, and I imagine there’s better ways to rewrite this entirely.

All help appreciated.

Thanks
Andy C

mbussieck · February 16, 2022, 10:32am

Hi,

Having a script to execute is key when you do performance analysis. I transformed your description and loop code into a model where I can easily change the size of the g and ay sets. I also have some code to generate random data for arbitrary sized instances (this code itself does not execute that fast). Anyhow, the essential code to do what you want to do is here:

Parameter posLastYear(g), valLastYear(g);
posLastYear(g) = smax(ay$MY_DATA(g,ay), ord(ay))-1;
valLastYear(g) = sum(ay$(ord(ay)=1), MY_DATA(g,ay+posLastYear(g)));
MY_DATA(g,ay)$(not MY_DATA(g,ay)) = valLastYear(g);

Determine the year position of the last nonzero element for each g. Get the value of this element and then assign this to all the zero values in MY_DATA.

For g=5000 and 75 years this takes no time <0.1 secs. Your loop code ran for 23 secs. For completeness, I have attached the entire code (the new and the loop code):
new15.gms (1.88 KB)
-Michael

AndrewC · February 17, 2022, 12:13am

Amazing, thank you so much.

Just a small note, I changed the code that assigned the valLastYear to the unpopulated values to be:

MY_DATA(g,ay)$(ord(ay) > posLastYear(g)) = valLastYear(g);

The reason is that we might have some 0 data in MY_DATA(g,ay) before the year that contains the last year value. E.g. 10, 10, 0, 15, 20, 0, 0, 0, 0… The 3rd year’s 0 value is meant to be 0, its only the final 0’s that need to be autopopulated to 20.

Thanks again
Andy C