Handling NaN (missing) values

achristensen · November 14, 2024, 3:51pm

Hi grecht, Great questions… thanks.

First some quick background, you already know much of this but just adding for completeness. There are 3 GAMS special values: EPS, NA and UNDF. EPS is used to explicitly represent a zero in GAMS and is mathematically zero, GAMS (and more explicitly, CMEX, our execution system, will not store zeros so we needed a way to get around this). NA can be used in GAMS to initialize a symbol, but is not assigned to a numerical value at all, it is more of a placeholder. If a model contains an NA value the user will get an execution error. Many people use NA in order to initialize a symbol and then put their data into that symbol. If there data doesn’t cover their model’s use case then they might know there is either missing data or an error in how the model is constructed. UNDF is a special value that is returned when a function evaluation goes sideways (like 1/0).

But now we are working in the world of python, so we need two things 1) to be able to represent all these special values from GAMS, but also 2) to maintain a float column datatype in order for pandas to be performant. Thus, we represent:

EPS as -0.0 (a negative zero, which is still mathematically zero)
UNDF as a nan
NA also as a nan

There are many many nans avaliable to use for UNDF and NA… so we specifically use:

UNDF is float("nan") which has a byte pattern of:

In [1]: struct.pack('>d', float("nan")).hex()
Out[1]: '7ff8000000000000'

np.nan also has the same byte pattern:

In [1]: struct.pack('>d', np.nan).hex()
Out[1]: '7ff8000000000000'

np.nan is, therefore, only interpreted as UNDF. Which is why you are not able to countNA in your p1.

NA is represented as a nan with a byte pattern of fffffffffffffffe:

In [1]: struct.unpack(">d", bytes.fromhex("fffffffffffffffe"))[0]
Out[1]: nan

The logic of choosing the float("nan") or np.nan for UNDF instead of NA follows other function returns like:

In [1]: np.sqrt(-1)
<ipython-input-11-597592b72a04>:1: RuntimeWarning: invalid value encountered in sqrt
  np.sqrt(-1)
Out[11]: np.float64(nan)

np.float64("nan") also has the same byte pattern as np.nan and float("nan").

A “special” nan is used for NA which, in GAMS means “initialized, but no numerical value assigned” aka “missing”.

Hopefully that helps untangle the nan behavior you are seeing.

Now on to the drop* methods.

dropUndef really means drop all nans that are GAMS UNDF special values.
dropNA really means drop all nans that are GAMS NA special values.
and
dropMissing really means drop all nans.

The “missing” naming follows pandas behavior for dropna… but you can see the obvious naming problem when compared to the pandas method – so we adopted the (hopefully clearer) dropMissing naming convention for a native GAMSPy method that will just get rid of all nans (and not rely on native pandas functionality – although mixing and matching pandas and GAMSPy methods is very common and is powerful).

You also state:

My use case is the following: I have a parameter based on which I want to define variable limits. This parameter is not defined for all set elements, and naturally I do not want to set any limit in the undefined cases. However, since GAMS assumes Parameters to be zero where they are not defined, it would simply set the limit to zero.

My suggestion is to simply define the sparse data for the parameter rather than using a numpy array. At this time, numpy arrays are assumed to be dense data structures which means that you must define all values for all domain tuples. This might get relaxed in a future release.

Something like this:

p1 = ct.addParameter("p1", domain=S, records=["b",1])

Then you could set your variable bounds:

v.lo[S].where[p1[S]] = p1[S]

hope this is helpful,
adam