System limits were reached on cluster

Hi,

For my master’s thesis I have to run pretty large GAMS simulation that are created via another tool. I assume these are correctly formed, given that this other tool works as expected.

I run them on the cluster I have access to, with the following command in the SLURM submission script:

srun $GAMSPATH/gams UCM_h.gms threads=16 workSpace=6300 > some-log_file.log

The job script specifies the resources to be allocated, I ask 16 cpus and 400MB per cpu, hence 6400MB in total. It’s worth mentioning that the UCM_h.gms file aslo specifies “Option threads=16;” so that it actually takes precedence over the command-line.

When I launch these, I get the following in the log (full log joined):

# ... lots of lines
Dual simplex solved model.

Root relaxation solution time = 12.91 sec. (9781.40 ticks)

        Nodes                                         Cuts/
   Node  Left     Objective  IInf  Best Integer    Best Bound    ItCnt     Gap

*     0+    0                       3.76641e+13   4.06277e+11            98.92%
Found incumbent of value 3.7664120e+13 after 55.04 sec. (50034.23 ticks)
      0     0   2.18079e+12 11426   3.76641e+13   2.18079e+12    70267   94.21%
*     0+    0                       3.76538e+13   2.18079e+12            94.21%
Found incumbent of value 3.7653757e+13 after 68.03 sec. (62585.06 ticks)
      0     0   2.18079e+12  8136   3.76538e+13    Cuts: 9899    83209   94.21%
      0     0   2.18080e+12  7895   3.76538e+13    Cuts: 7452    98828   94.21%
      0     0   2.18080e+12  6865   3.76538e+13    Cuts: 5977   110708   94.21%
*     0+    0                       3.76448e+13   2.18080e+12            94.21%
Found incumbent of value 3.7644847e+13 after 92.52 sec. (82006.49 ticks)
      0     0  -1.00000e+75     0   3.76448e+13   2.18080e+12   110708   94.21%
*     0+    0                       3.54979e+12   2.18080e+12            38.57%
Found incumbent of value 3.5497876e+12 after 100.50 sec. (86565.08 ticks)
      0     0   2.18080e+12  6612   3.54979e+12    Cuts: 3885   118320   38.57%
      0     0   2.18080e+12  6469   3.54979e+12    Cuts: 2720   123469   38.57%
Detecting symmetries...
      0     0   2.18080e+12  6306   3.54979e+12    Cuts: 1590   125968   38.57%
*     0+    0                       3.48532e+12   2.18080e+12            37.43%
Found incumbent of value 3.4853207e+12 after 130.46 sec. (107363.89 ticks)
      0     0   2.18080e+12  6400   3.48532e+12    Cuts: 1014   128020   37.43%
      0     0   2.18080e+12  6424   3.48532e+12     Cuts: 745   129478   37.43%
      0     0   2.18080e+12  6329   3.48532e+12     Cuts: 461   130489   37.43%
      0     0   2.18080e+12  6242   3.48532e+12     Cuts: 364   131104   37.43%
*     0+    0                       3.48333e+12   2.18080e+12            37.39%
Found incumbent of value 3.4833338e+12 after 139.51 sec. (113121.00 ticks)
      0     0   2.18080e+12  6519   3.48333e+12     Cuts: 192   131452   37.39%
Heuristic still looking.
Heuristic still looking.
--- Reading solution for model UCM_SIMPLE
--- Executing after solve: elapsed 0:24:50.581
--- UCM_h.gms(1161) 1866 Mb
--- GDX File (execute_unload) /home/ulg/thermlab/fstraet/work/data-generation/simulations/ic-2000/sim-0_1.61-0.99-0.20-0.24-0.48-0.35/debug.gdx
--- Generating MIP model UCM_SIMPLE*** Error: Could not spawn gamscmex, rc = 4
           Cmex executable : /home/users/f/s/fstraet/gams37.1_linux_x64_64_sfx/gamscmex.out
           System directory: /home/users/f/s/fstraet/gams37.1_linux_x64_64_sfx

From https://www.gams.com/latest/docs/UG_GAMSReturnCodes.html, I see this error code corresponds to “the system limits were reached”.
I can check the job memory usage, and I see “Memory Utilized: 6.06 GB Memory Efficiency: 96.98% of 6.25 GB”. However I don’t think it is a memory issue, mainly because there is a return code for that (10).

Also while checking the job afterwards, I also observe “CPU Efficiency: 10.76% of 07:04:32 core-walltime”. I have pre-post processing scripts in python so these waste a bit of performance, but take around 2 minutes to run so it should still be able to go up to around 95%.

Therefore, my questions are:

  • What does “System limits” actually refers to in this context ?
  • Did I do something wrong while calling GAMS so that it fails ?
  • Is it normal that I get such low CPU efficiency on the cluster ?

If I forgot some important piece of information, please tell me.
Any help is really appreciated

François
gamsrun_0-0.log (75.3 KB)

What does “System limits” actually refers to in this context ?

GAMS has some internal limits, e.g. the number of variables a model can have <2.1e9 and other limits. This is what this return code means. I am a little surprised that the log did not give you more information on what limit was hit. Did the LST file contain more information? The entire log is fishy, e.g. the Cplex log looks funny, I am missing the tail of a GAMS/Cplex solve. Not sure where this went. So I would not trust the log (from the cluster) very much and probably not the return code either.

Did I do something wrong while calling GAMS so that it fails ?

Probably not. GAMS “dies” in the step where it generates a model UCM_SIMPLE again. Can you estimate how big the model is going to be? Is it of comparable size than the first one? If so, you can easily run this on any off the shelf computer. See if you get more/better info when you do this. I would also try to use the latest and greatest GAMS.

Is it normal that I get such low CPU efficiency on the cluster ?

Yes. There is little Cplex can do in parallel. The root relaxation (Cplex uses concurrent optimization) is done in 12secs. Then it starts B&C and terminates in node 0. So there is nothing to parallelize. Parallel B&C make sense when you go through many nodes (and the B&C tree Cplex builds up is wide so it can process nodes in parallel). Just to be clear, cluster or not Cplex utilizes the cores of a single CPU, so you can just run this on your local laptop and probably get as good a performance as on this cluster.

-Michael

First, thank you for the clarifying answer.

For memory constrain reasons the scripts remove the files once the results are read, thus I have not the LST file as of now (I started the run, and will get to it later).
However I have an example of the model that is simulated, around 1600 lines long (see attachment). This example is not the one that caused the problem though, but the difference are for the most part a matter of input data (I’ll have it alongside with the lst file)

And for parallel execution I’ll just go with very few threads then. For some reason the model is created with “Option threads=16”, I did not suspect that it couldn’t really be taken advantage of.

François
UCM_h.gms (63.9 KB)

In order to run this, we would need a GDX file. Can you supply one?

Moreover, the export to Excel can be improved (performance). Rather than calling gdxxrw for each variable/parameter, you can create (at compile $on/offEcho or execution time $on/offPut) a gdxxrw instruction file:

$if not %PrintResults%==1 $exit

EXECUTE 'GDXXRW.EXE "%inputfilename%" O="Results.xlsx" Squeeze=N par=Technology rng=Technology!A1 rdim=2 cdim=0'
EXECUTE 'GDXXRW.EXE "%inputfilename%" O="Results.xlsx" Squeeze=N par=PowerCapacity rng=PowerCapacity!A1 rdim=1 cdim=0'
EXECUTE 'GDXXRW.EXE "%inputfilename%" O="Results.xlsx" Squeeze=N par=PowerInitial rng=PowerInitialA1 rdim=1 cdim=0'
EXECUTE 'GDXXRW.EXE "%inputfilename%" O="Results.xlsx" Squeeze=N par=RampDownMaximum rng=RampDownMaximum!A1 rdim=1 cdim=0'
...

So rather do

$if not %PrintResults%==1 $exit
file fx / gdxxrw.in /; put fx;
$onPut
i="%inputfilename%"
o=Results.xlsx
Squeeze=N par=Technology rng=Technology!A1 rdim=2 cdim=0
Squeeze=N par=PowerCapacity rng=PowerCapacity!A1 rdim=1 cdim=0
Squeeze=N par=PowerInitial rng=PowerInitialA1 rdim=1 cdim=0
Squeeze=N par=RampDownMaximum rng=RampDownMaximum!A1 rdim=1 cdim=0
...
$offPut
putclose fx;
EXECUTE 'GDXXRW.EXE @gdxxrw.in';

With more recent versions of GAMS you can even get the Excel file created on your Linux cluster using GAMS/Connect, see PandasExcelWriter (https://www.gams.com/latest/docs/UG_GAMSCONNECT.html#UG_GAMSCONNECT_PANDASEXCELWRITER) for details.

-Michael

Hello,

Here is the set of files used to run a simulation, and the LST and log I got from running it.

From the LST file it looks like CPLEX’s execution is aborted by the user (“SOLVER STATUS 13 System Failure MODEL STATUS 13 Error No Solution”, apparently no resource used so no computations at all).
But what surprised me the most is that the LST file ends right in the middle of a number, i.e. an IO write. So I’d guess some thread was writing while another one triggered some error (that did not make it into the LST file).

As I previously mentioned, these GAMS models are generated from another tool. The thing is, that I’m running these simulation on unusual inputs ranges, compared to the typical runs that have proved that tool is indeed working. So I’m trying on a more “classical” input setting, expecting that one to go well.

If not, so far my best guess is that the interrupt would come from the cluster.

Again, thank you a lot for your help,
François
UCM_h.lst.txt (264 KB)
Inputs.gdx (88.6 MB)
UCM_h.gms (65.5 KB)
gamsrun_0-2.log (435 KB)
cplex.opt.txt (149 Bytes)

UCM_h.log (67.8 KB)
This all looks very much like the cluster is killing your jobs. GAMS runs the solver as an executable (communicating the model instance via scratch files) so even if the solver dies, the GAMS job should nicely continue and write proper lst files. Since this does not happen (and since the model just works fine (see comment about runtime below) on my laptop) I can only conclude the problem is with the cluster. First it kills your GAMS/CPLEX jobs (Model/Solve status 13/13) and eventually the GAMS job itself probably due to some limit (memory, disk, cpu, time, IO reads, whatnot) the job exceeds. Your MIPs are difficult and CPLEX will just build up a very large B&C tree and won’t be able to solve to your required optimality tolerance. So even if your cluster would not kill the jobs, I doubt that you would get much joy out of your computations because the very first GAMS/CPLEX job won’t finish in any reasonable time (I killed it after 2h and attached the log so far. ). You did not specify a time limit, so this might runs forever.

-Michael

So, after some more research (and detours in the meantime), I made the following changes and I no longer have any problem:

  • I augmented the amount of memory allocated to the cluster for my simulations (from 6.4 to 25.6 GB)
  • I reduced the size of the rolling horizon of the simulations being run. (The tool that creates the simulations runs over a year, as solving for the whole year is intractable optimization horizon is set to some value, which I decreased)

I think the main cause was the memory, I looked at my job’s performance (99% CPU efficiency since using only 1 thread now), I got ±30% memory efficiency out of 25GB allocated, that is 7.6 GB > 6.4 GB (thus I will look for a better memory allocation :smiley: )

Anyway, thanks a lot for the the guidance!
François

May I solicit your advice on how to augmented the amount of memory allocated to the cluster for my simulations?
My computer’s memory is large enough, but I don‘t know how to augmented the GAMS’s internal limits