Improving/automating data input: Excel -> CSV -> GDX -> GAMS

Hi again, GAMS fanatics.

I posted a while ago about data input. I managed to solve some of my issues and I have a working process. First, I want to share it here with you. Second, I would like your advice to make it faster.

Context
I run a work-in-progress (and therefore changing all the time) model that requires multiple sets and parameters, sometimes with thousands of data each. I wanted a system that would allow me to add/remove/rename sets and parameters in the model, and then easily go back to the data, then re-import.

I checked GAMS Connect, but it seems to require constant keep-up and attention to naming declarations, locations, sizes of sets/parameters. I thus looked for a way to import any set/parameter, regardless of the size.

Data structure

  1. I download the original datasets from the sources.
  2. I connect all the datasets to a master data model in Excel, through Power Query and Power Pivot
  3. I recompose the data in the way I need for each set/parameter. Since all the datasets are connected, it is easy to make structural changes on the fly.
  4. I have two other master files. One for sets (Model/Data/Sets/Sets.xlsm). Another for parameters (Model/Data/Parameters/Parameters.xlsm)
  5. Once I am happy with the data model’s results, I copy-paste the data in the sets or parameters data files.
  6. In these sets and parameters files, I have different tabs for different parts of the model. Each tab has a combination of Excel tables (not ranges) and Pivot Tables. I add the name of the set/parameter on top of the table.
  7. I built a macro in each of the Excel files to go through all the pivot tables and tables, then extract them as CSV files in Model/Data/Sets/SetCSV and Model/Data/Parameters/ParametersCSV
  8. Once I am happy with the data for my model, I press a button and the macros create the CSV files. Now we have two folders full of CSVs for each set or parameter.

Modelling

  1. I define all my sets in Model/parameters/sets.set
  2. I run Data/LoadData.gms with this code this is slow:
* Declaring the sets
$include Model/parameters/sets.set

* Declaring the paths 
$setLocal Set_CSV Data/Sets/SetCSV/
$setLocal Set_GDX Data/Sets/SetGDX

* Embedded Batch File: CSV-to-GDX Conversion for sets
$onEcho > Data\Sets\set_csv2gdx.bat
@echo off
setlocal enabledelayedexpansion

rem -- Jump to the folder where this script lives (Data/Sets)
cd /d %~dp0

rem -- Define paths relative to this script
set "CSV_PATH=%~dp0SetCSV"
set "GDX_PATH=%~dp0SetGDX"

echo Converting CSV files from: !CSV_PATH!
echo Output GDX files to: !GDX_PATH!

rem -- Delete any existing GDX files in the destination folder
del /f /q "!GDX_PATH!\*.*" 2>nul
if exist "!GDX_PATH!\sets.gdx" del /f /q "!GDX_PATH!\sets.gdx"

rem -- Create a temporary index file in the working directory
echo #FileName,HeaderCount,DataCount,SkipExtra > csv_index.txt

for %%f in ("!CSV_PATH!\*.csv") do (
    set "filename=%%~nf"

    rem –– count total lines (including header)
    for /f %%L in ('find /v /c "" ^< "%%f"') do set totalLines=%%L

    rem –– read header & data counts as before…
    set "header="
    set /p header=<"%%f"
    for /f "tokens=1 delims=," %%H in ("!header!") do set "firstHeader=%%H"
    if /I "!firstHeader!"=="!filename!" (set "skipExtra=1") else (set "skipExtra=0")
    call :CountCommas "!header!"   & set /a headerCount=!COMMA_COUNT!+1

    set "dataLine="
    for /f "usebackq skip=1 delims=" %%D in ("%%f") do if not defined dataLine set "dataLine=%%D"
    call :CountCommas "!dataLine!" & set /a dataCount=!COMMA_COUNT!+1

    rem –– now compute the Excel range using totalLines
    if "!skipExtra!"=="1" (
        set /a effdims=dataCount - 1
        call :NumToCol !dataCount!
        set "range=B2:!col!!totalLines!"
    ) else (
        set /a effdims=dataCount
        call :NumToCol !dataCount!
        set "range=A2:!col!!totalLines!"
    )

    echo Importing !filename!.csv as a set with !effdims! dimensions using range !range!...
    gdxxrw "!CSV_PATH!\!filename!.csv" ^
           output="!GDX_PATH!\!filename!.gdx" ^
           set=!filename! rng=!range! rdim=!effdims! maxdupeerrors=999999
)


rem -- Move the temporary index file into the GDX folder (cleanup)
move /Y csv_index.txt "!GDX_PATH!\"

endlocal
exit /b

:CountCommas
setlocal enabledelayedexpansion
set "line=%~1"
set "count=0"
:CCLoop
if "!line!"=="" goto CCDone
set "firstChar=!line:~0,1!"
set "line=!line:~1!"
if "!firstChar!"=="," set /a count+=1
goto CCLoop
:CCDone
endlocal & set COMMA_COUNT=%count%
exit /b

:NumToCol
setlocal enabledelayedexpansion
set "alphabet=ABCDEFGHIJKLMNOPQRSTUVWXYZ"
set /a idx=%1 - 1
set "col=!alphabet:~%idx%,1!"
endlocal & set col=%col%
exit /b
$offEcho


* Embedded Batch File: GDX Loader for sets
$onEcho > Data\Sets\set_gdxload.bat
@echo off
setlocal enabledelayedexpansion

rem -- Define the folder where GDX files are stored using the GAMS macro.
set "GDX_PATH=%cd%\%Set_GDX%"

rem -- Delete any existing load_sets.inc file in the GDX folder.
if exist "!GDX_PATH!\load_sets.inc" del /f /q "!GDX_PATH!\load_sets.inc"

rem -- Generate the include file from the parameter list (assumed to be in Data\Input\parameterslist.txt).
(
    for /f "usebackq delims=" %%L in ("Data\Sets\setslist.txt") do (
         echo $gdxin "%Set_GDX%/%%L.gdx"
         echo $load %%L
         echo $gdxin
    )
) > "!GDX_PATH!\load_sets.inc"

rem Optionally, move the sets list file into the GDX folder:
rem move /Y "Data\Sets\setslist.txt" "!GDX_PATH!\"

endlocal
exit /b
$offEcho

* Calling the batch files
$if %import_sets% == Y
$call Data\Sets\set_csv2gdx.bat
$call Data\Sets\set_gdxload.bat

* Now include the generated load file so that all individual GDX sets are loaded.
$include "%Set_GDX%/load_sets.inc"

* Declaring the parameters
$include Model/parameters/parameters.par

* Declaring the paths 
$setLocal Par_CSV Data/Parameters/ParametersCSV/
$setLocal Par_GDX Data/Parameters/ParametersGDX

* Embedded Batch File: CSV-to-GDX Conversion for parameters
$onEcho > Data/Parameters/par_csv2gdx.bat
@echo off
setlocal enabledelayedexpansion

rem -- Jump to the folder where this script lives (Data/Parameters)
cd /d %~dp0

rem -- Define paths relative to this script
set "CSV_PATH=%~dp0ParametersCSV"
set "GDX_PATH=%~dp0ParametersGDX"

echo Converting parameter CSV files from: !CSV_PATH!
echo Output GDX files to:       !GDX_PATH!

rem -- Delete any existing GDX files in the destination folder
del /f /q "!GDX_PATH!\*.*" 2>nul

rem -- Create a temporary index file with headers
echo FileName,HeaderCount,DataCount,SkipExtra,TotalLines>csv_index.txt

rem -- 1) Build the index and capture TotalLines
for %%F in ("!CSV_PATH!\*.csv") do (
    set "filename=%%~nF"

    rem -- Count total lines (including header)
    for /f %%L in ('find /v /c "" ^< "%%F"') do set "totalLines=%%L"

    rem -- Read header and detect redundant first column
    set /p "header="<"%%F"
    for /f "tokens=1 delims=," %%H in ("!header!") do set "firstHeader=%%H"
    if /I "!firstHeader!"=="!filename!" (set "skipExtra=1") else (set "skipExtra=0")

    rem -- Count commas in header → headerCount
    call :CountCommas "!header!"
    set /a headerCount=COMMA_COUNT+1

    rem -- Grab first data line and count commas → dataCount
    set "dataLine="
    for /f "usebackq skip=1 delims=" %%L in ("%%F") do if not defined dataLine set "dataLine=%%L"
    call :CountCommas "!dataLine!"
    set /a dataCount=COMMA_COUNT+1

    rem -- Append everything (including totalLines) to index
    echo !filename!,!headerCount!,!dataCount!,!skipExtra!,!totalLines!>>csv_index.txt
)

rem -- 2) Convert each CSV using the index
for /f "skip=1 tokens=1-5 delims=," %%a in (csv_index.txt) do (
    set "fname=%%a"
    set "headerCount=%%b"
    set "dataCount=%%c"
    set "skipExtra=%%d"
    set "rowCount=%%e"

    rem -- Trim any stray spaces
    set "skipExtra=!skipExtra: =!"
    set "rowCount=!rowCount: =!"

    if "!skipExtra!"=="1" (
        rem -- drop redundant first+value columns
        set /a effdims=dataCount-2, colIndex=dataCount-1
        call :NumToCol !colIndex!
        set "range=B2:!col!!rowCount!"
    ) else (
        rem -- use A through (dataCount-1)
        set /a effdims=dataCount-1, colIndex=dataCount-1
        call :NumToCol !colIndex!
        set "range=A2:!col!!rowCount!"
    )

    echo Importing !fname!.csv as a parameter with !effdims! dimensions using range !range!...
    gdxxrw "!CSV_PATH!\!fname!.csv" ^
           output="!GDX_PATH!\!fname!.gdx" ^
           par=!fname! rng=!range! rdim=!effdims! maxdupeerrors=999999
)

rem -- Move the index file for record-keeping
move /Y csv_index.txt "!GDX_PATH!\"

endlocal
exit /b

:::::::::::::::::::::::::::::::::::::::::
:CountCommas
  setlocal enabledelayedexpansion
  set "line=%~1"
  set "count=0"
:CCLoop
  if "!line!"=="" goto CCDone
  if "!line:~0,1!"=="," set /a count+=1
  set "line=!line:~1!"
  goto CCLoop
:CCDone
  endlocal & set COMMA_COUNT=%count%
  exit /b

:::::::::::::::::::::::::::::::::::::::::
:NumToCol
  setlocal enabledelayedexpansion
  set "alphabet=ABCDEFGHIJKLMNOPQRSTUVWXYZ"
  set /a idx=%1 - 1
  set "col=!alphabet:~%idx%,1!"
  endlocal & set col=%col%
  exit /b

$offEcho

* Embedded Batch File: GDX Loader for parameters
$onEcho > Data/Parameters/par_gdxload.bat
@echo off
setlocal enabledelayedexpansion

rem -- Define the folder where GDX files are stored using the GAMS macro.
set "GDX_PATH=%cd%\%Par_GDX%"

rem -- Delete any existing load_params.inc file in the GDX folder.
if exist "!GDX_PATH!\load_params.inc" del /f /q "!GDX_PATH!\load_params.inc"

rem -- Generate the include file from the parameter list (assumed to be in Data\Input\parameterslist.txt).
(
    for /f "usebackq delims=" %%L in ("Data\Parameters\parameterslist.txt") do (
         echo $gdxin "%Par_GDX%/%%L.gdx"
         echo $load %%L
         echo $gdxin
    )
) > "!GDX_PATH!\load_params.inc"

rem Optionally, move the parameters list file into the GDX folder:
rem move /Y "Data\Parameters\parameterslist.txt" "!GDX_PATH!\"

endlocal
exit /b
$offEcho

* Calling the batch files
$if %import_parameters% == Y
$call Data\Parameters\par_csv2gdx.bat
$call Data\Parameters\par_gdxload.bat

* Now include the generated load file so that all individual GDX sets are loaded.
$include "%Par_GDX%/load_params.inc"

This creates 2 batch files for sets, and another 2 for parameters.

The first batch file (set_csv2gdx.bat and par_csv2gdx.bat):

  • converts the CSV files I just created to individual GDX files with the same name

e.g. my_parameter.csv → my_parameter.gdx

  • puts them in the folder Data/Sets/SetGDX or Data/Parameters/ParameterGDX
  • Creates a txt file with the full list of sets and parameters (setslist.txt and parameterslist.txt)
my_parameter
my_other_parameter
...

The second batch file (set_gdxload.bat and par_gdxload.bat):

  • Creates an include file load_sets.inc and load_params.inc that looks like this:
$gdxin "Data/Parameters/ParametersGDX/my_parameter.gdx"
$load my_parameter
$gdxin
$gdxin "Data/Parameters/ParametersGDX/my_other_parameter.gdx"
$load my_other_parameter
$gdxin
... 
  1. I run the model with tagged flags when I want to import sets and parameters or to skip them.

Thoughts
This is slow because I see that GAMS is running individual instances of conversion from CSV to GDX for each of the sets and parameters, taking around 1-3 seconds each! This isn’t a problem if I run the code and forget about it but sometimes I need quick runs and then this becomes a few minutes of waiting JUST to import the datasets.

The good part of the code is that it automatically does everything I need. The bat files are automatically detecting the CSVs in the determined folders, determining how many rows of data they have and how many dimensions so I don’t need to code that elsewhere since it is in the loop of the bat file.

I tried code to run the CSV to GDX conversion in parallel many times without success, thinking that maybe the computer can use the cores to process the conversion in parallel rather than doing one at a time.

I think that at this point I am extremely overcomplicating things. However, bottomline, is that this code does what I need beautifully, but sloooowly.

I am neither a GAMS expert nor fanatic, but as a regular user of GAMS I was curious about the reported slowness. Thus far I have not been too much needing to import CSV files into GAMS, but it is becoming more regular for me as well. There seems to be quite a few options for doing it:

  1. Direct reading at compile time,
  2. Using the csv2gdx.exe tool, either at compile or execution time
  3. Using the CSVReader Connect agent
  4. Using the csvread GAMS tool either at compile or execution time
  5. Using the gdxxrw.exe tool either at compile or execution time

As in your example, you seem to be creating and loading the GDX files at compile time (with domain sets pre-defined), I tested the performance of the first and the second options at compile time. I prepared 100 CSV files, each for a corresponding multi-dimensional GAMS parameter (named for test purpose just par001.csv,…,par100.csv), with their domain sets predefined. Each of the files had 5000 data rows, and thus 5000 parameter instances. All together, the files thus had 500,000 data lines and parameter instances.

Direct reading: I was happy to see that the direct reading of these all files into the parameters took just 0.2 seconds, and so I could not see any slowness there.
Using the csv2gdx.exe: This method took significantly longer, about 3.1 seconds on my computer. But I think that is kind of understandable, due to the overhead of writing 100 GDX files with 100 calls for a separate executable (causing also some anti-malware software monitoring overhead), before loading the parameters then into GAMS.

Anyway, I was glad to see that neither of these two methods did not manifest any truly unexpected slowness. I suspect that the use of complex batch files and gdxxrw may be causing the slowness in your case. Admittedly, I just imported all the data records in each file, assuming each CSV has exactly the columns matching the corresponding parameter indexes (plus the value column), which may also explain the better performance.