Handling NaN (missing) values

Description of the issue

I have found an issue and have a general question about handling NaN values.

np.nan values are not interpreted as gp.SpecialValues.NA

Maybe I 'm misunderstanding what gp.SpecialValues.NA is supposed to represent, but np.nan values are not interpreted as this. Is this intended, meaning I would have to map np.nan values to gp.SpecialValues.NA?

import gamspy as gp

ct = gp.Container()
S = ct.addSet("S", records=["a", "b"])

p1 = ct.addParameter("p1", domain=S, records=np.array([[np.nan, 1]]))
p2 = ct.addParameter("p2", domain=S, records=np.array([[gp.SpecialValues.NA, 1]]))

print(p1.countNA())
print(p2.countNA())

Output:

0
1

Edit: I just realized that the np.nan value in p1 is considered both “missing” and “undefined” since p1.dropMissing() and p1.dropUndefined() both work, removing this np.nan value. Now I am a bit confused. I would expect that a state of “NA” (not available?) would be implied by both the “missing” and “undefined” states. So in this case I would expect that p1.dropNA() should lead to the same results as p1.dropMissing() and p1.dropUndefined().
Also, while I understand that there can be a difference between “undefined” and “missing”, I do not understand how there can be a difference between “NA” and “missing”.

General question on handling NaN values

My use case is the following: I have a parameter based on which I want to define variable limits. This parameter is not defined for all set elements, and naturally I do not want to set any limit in the undefined cases. However, since GAMS assumes Parameters to be zero where they are not defined, it would simply set the limit to zero.

How do I circumvent this? My first idea was to explicitly set the Parameter to NaN in the undefined cases and then adding a condition in the assignment requiring the Parameter to not be equal to NaN. But this feels a little backwards and may decrease performance. Is there a better way?

GAMSPy version

1.1.0

Hi grecht, Great questions… thanks.

First some quick background, you already know much of this but just adding for completeness. There are 3 GAMS special values: EPS, NA and UNDF. EPS is used to explicitly represent a zero in GAMS and is mathematically zero, GAMS (and more explicitly, CMEX, our execution system, will not store zeros so we needed a way to get around this). NA can be used in GAMS to initialize a symbol, but is not assigned to a numerical value at all, it is more of a placeholder. If a model contains an NA value the user will get an execution error. Many people use NA in order to initialize a symbol and then put their data into that symbol. If there data doesn’t cover their model’s use case then they might know there is either missing data or an error in how the model is constructed. UNDF is a special value that is returned when a function evaluation goes sideways (like 1/0).

But now we are working in the world of python, so we need two things 1) to be able to represent all these special values from GAMS, but also 2) to maintain a float column datatype in order for pandas to be performant. Thus, we represent:

EPS as -0.0 (a negative zero, which is still mathematically zero)
UNDF as a nan
NA also as a nan

There are many many nans avaliable to use for UNDF and NA… so we specifically use:

UNDF is float("nan") which has a byte pattern of:

In [1]: struct.pack('>d', float("nan")).hex()
Out[1]: '7ff8000000000000'

np.nan also has the same byte pattern:

In [1]: struct.pack('>d', np.nan).hex()
Out[1]: '7ff8000000000000'

np.nan is, therefore, only interpreted as UNDF. Which is why you are not able to countNA in your p1.

NA is represented as a nan with a byte pattern of fffffffffffffffe:

In [1]: struct.unpack(">d", bytes.fromhex("fffffffffffffffe"))[0]
Out[1]: nan

The logic of choosing the float("nan") or np.nan for UNDF instead of NA follows other function returns like:

In [1]: np.sqrt(-1)
<ipython-input-11-597592b72a04>:1: RuntimeWarning: invalid value encountered in sqrt
  np.sqrt(-1)
Out[11]: np.float64(nan)

np.float64("nan") also has the same byte pattern as np.nan and float("nan").

A “special” nan is used for NA which, in GAMS means “initialized, but no numerical value assigned” aka “missing”.

Hopefully that helps untangle the nan behavior you are seeing.

Now on to the drop* methods.

dropUndef really means drop all nans that are GAMS UNDF special values.
dropNA really means drop all nans that are GAMS NA special values.
and
dropMissing really means drop all nans.

The “missing” naming follows pandas behavior for dropna… but you can see the obvious naming problem when compared to the pandas method – so we adopted the (hopefully clearer) dropMissing naming convention for a native GAMSPy method that will just get rid of all nans (and not rely on native pandas functionality – although mixing and matching pandas and GAMSPy methods is very common and is powerful).

You also state:

My use case is the following: I have a parameter based on which I want to define variable limits. This parameter is not defined for all set elements, and naturally I do not want to set any limit in the undefined cases. However, since GAMS assumes Parameters to be zero where they are not defined, it would simply set the limit to zero.

My suggestion is to simply define the sparse data for the parameter rather than using a numpy array. At this time, numpy arrays are assumed to be dense data structures which means that you must define all values for all domain tuples. This might get relaxed in a future release.

Something like this:

p1 = ct.addParameter("p1", domain=S, records=["b",1])

Then you could set your variable bounds:

v.lo[S].where[p1[S]] = p1[S]

hope this is helpful,
adam

1 Like

Hi Adam,

thank you, this was quite helpful. But to what logical term is p[S] evaluated in a where statement? In the GAMSPy user guide right under “Conditions and Assignments” (which I somehow missed before this post) it says that p[S] in a where statement indicates the existence of the parameter. However, here is an example where this is not true:

import gamspy as gp
import numpy as np

ct = gp.Container()
S = ct.addSet("S", records=["a", "b"])

p = ct.addParameter("p", domain=S, records=[["a", 1], ["b", 0]])

x = ct.addVariable("x", domain=S)
x.fx[S].where[p[S]] = p[S]

print(x.records)

Output:

   S  level  marginal  lower  upper  scale
0  a    1.0       0.0    1.0    1.0    1.0

The variablex["b"] should be fixed to zero, which does not happen here. (Of course, setting p["b"] = 0 is the same as not setting it due to GAMS assuming it to be zero if it is not set.) So I assume p[S] in a where statement evaluates to p[S] != 0?
In that case, I must use the following (note that I explicitly added a np.nan value as “c”), as I proposed in my initial post:

import gamspy as gp
import numpy as np

ct = gp.Container()
S = ct.addSet("S", records=["a", "b", "c"])

p = ct.addParameter("p", domain=S, records=[["a", 1], ["b", 0], ["c", np.nan]])

x = ct.addVariable("x", domain=S)
x.fx[S].where[p[S] != gp.SpecialValues.UNDEF] = p[S]

print(x.records)

Output

   S  level  marginal  lower  upper  scale
0  a    1.0       0.0    1.0    1.0    1.0
1  b    0.0       0.0    0.0    0.0    1.0

Cheers
Gereon

Hi Gereon,

This is a good example of how GAMS does not store zeros. If I convert this to pure GAMS the code would look like this:

set s / a,b/;
parameter p(s) / a 1, b 0 /;

variable x(s);

x.fx(s)$p(s) = p(s);

display p;
display x.lo;
display x.up;

This code will run but displaying the parameter p and the bounds on x shows:

----      8 PARAMETER p  

a 1.000


----     10 VARIABLE x.Lo (-INF) 

a 1.000


----     11 VARIABLE x.Up (+INF) 

a 1.000

So the element p(b) does not even exist.

If you want to come up with something mathematically equivalent to setting something to zero you will need to use an EPS.

parameter p(s) / a 1, b EPS /;

Then it follows that:

----      8 PARAMETER p  

a 1.000,    b   EPS


----      9 VARIABLE x.Lo (-INF) 

a 1.000,    b   EPS


----     10 VARIABLE x.Up (+INF) 

a 1.000,    b   EPS

It’s a bit of a mental shift, but if you are fixing a variable to zero GAMS’ best practices would probably suggest that you do not include the variable in the model at all.

Hi Adam,

not including the variable in the model does not work for me, because it is not static. In some cases the variable might be fixed, in some it might not, depending on the input data. The set S might change as well.
Sure, the “GAMSonic” way could be to create an additional subset of S for variables that are to be fixed. Or simply changing the set S and adding the fixed variables directly via the parameter. However, I have many of such variables, and this seems bureaucratic. I’d rather just fix the variable if the parameter contains data for it, and it ought to be simple to do so.

Another point: When assigning np.nan values to a parameter, it is apparently parsed to gp.SpecialValues.UNDEF. Since I cannot put gp.SpecialValues.NA in a field of a np.array or a pd.Series (as you said, it is represented as np.nan), it is not possible for a np.array or a pd.Series to contain values that are parsed to gp.SpecialValues.NA. However, if they represent missing values and not the result of an undefined operation, that is exactly what they should be parsed to. Still, this is something I can work with.

I conclude that for my purposes I have to model it as I proposed in my first post: Explicitly include missing values as np.nan in the input data (thereby removing the advantage of a sparse representation), and then only fixing those variables whose values are not equal to gp.SpecialValues.UNDEF. The example would then look like this:

import gamspy as gp
import numpy as np

ct = gp.Container()
S = ct.addSet("S", records=["a", "b", "c"])

p = ct.addParameter(
    "p",
    domain=S,
    records=np.array([[1, 0, np.nan]])
)

x = ct.addVariable("x", domain=S)
x.fx[S].where[p[S] != gp.SpecialValues.UNDEF] = p[S]

print(x.records)

Output:

   S  level  marginal  lower  upper  scale
0  a    1.0       0.0    1.0    1.0    1.0
1  b    0.0       0.0    0.0    0.0    1.0

Cheers
Gereon

This is not right. You can have a np.array that contains gp.SpecialValues.NA… it is mapped to a nan, but it is mapped to a special nan. np.nan is a also a nan, but it has a different bit pattern. This also works with other objects like a pd.Series etc…

You can see this in the following example:

In [3]: arr = np.array([1, 0, np.nan, gp.SpecialValues.UNDEF, gp.SpecialValues.NA])

In [4]: gp.SpecialValues.isUndef(arr)
Out[4]: array([False, False,  True,  True, False])

In [5]: gp.SpecialValues.isNA(arr)
Out[5]: array([False, False, False, False,  True])

Alright, this seems to work. In that case, the error I found must happen during the mapping of NaN values to gp.SpecialValues.NA. See this example:

import gamspy as gp
import numpy as np
import pandas as pd

s = pd.Series(data=[gp.SpecialValues.NA, np.nan])
print(gp.SpecialValues.isNA(s))

s = s.fillna(gp.SpecialValues.NA)
print(gp.SpecialValues.isNA(s))

Output

[ True False]
[False False]

The first output is as expected. For the second I would have expected [True True], but apparently the mapping does not work and even maps the existing “special” nan to np.nan. Or am I missing something?

In practice, of course I would have to map the np.nan values to gp.SpecialValues.NA, so the problem I was describing in the previous post still holds.

You are using the pandas method fillna to do some work, but we cannot guarantee that they follow the same nan rules that we do. In fact, we go through some trouble to maintain special nan values in our APIs.

In fact it seems tricky to get pandas to remap the nan values with a bool mask.:

s = pd.Series(data=[1, 2, np.nan])
s[gp.SpecialValues.isUndef(s)] = gp.SpecialValues.NA

In [1]: gp.SpecialValues.isNA(s)
Out[1]: array([False, False, False])

However, it appears that numpy is able to do this remapping and maintain the nan bit patterns:

In [1]: arr = np.array([1, 2, np.nan])

In [2]: arr
Out[2]: array([ 1.,  2., nan])

In [3]: arr[gp.SpecialValues.isUndef(arr)] = gp.SpecialValues.NA

In [4]: gp.SpecialValues.isNA(arr)
Out[4]: array([False, False,  True])