File size problem after harpmerge

I am trying to merge two netcdf files with harpmerge in python in two method. Each netcdf file size is 127 MB. It is important for me to clip the netcdf dataset based on my study area.

First method:

I merged these tow files with harp in python:

myfiles = ‘…/*.nc’
i = 0
file_names =
products = # array con los productos a juntar

for filename in os.listdir(os.path.dirname(os.path.abspath(myfiles))):
base_file, ext = os.path.splitext(filename)

if ext == ".nc" and base_file.split('_')[0] == 'S5P':
    product_name = base_file + "_" + str(i)
    
    try:
        product_name = harp.import_product( base_file + ext,
                                           operations="latitude > 24.8 [degree_north]; latitude < 39.9 [degree_north];\
                                           longitude > 43.8 [degree_east]; longitude < 63.4 [degree_east];\
                                           bin_spatial(1511,24.8,0.01,1961,43.8,0.01); derive(longitude {longitude});\
                                           derive(latitude {latitude})")
        print("Product " + base_file + ext + " imported")
        
        products.append(product_name)
    except:
        print ("Product not imported")
        
    i = i + 1

product_bin = harp.execute_operations(products, post_operations=“bin();squash(time, (latitude,longitude))”)

harp.export_product(product_bin, “merged.nc”)

Merged netcdf file created has 296 MB size, It is very large because when I want to use this method for many files, it will need huge capacity of hdd.

Second method:

In this method I tried to use xarray capability, for do that first I clip each file based on my study area:

ds = xr.open_dataset(file, group = ‘PRODUCT’)
ds_ir = ds.where((44 < ds.longitude) & (ds.longitude < 65)
& (24 < ds.latitude) & (ds.latitude < 41), drop=True)

src_fname, ext = os.path.splitext(file) # split filename and extension
save_fname = os.path.join(outpath, os.path.basename(src_fname)+‘.nc’)
ds_ir.to_netcdf(save_fname)

File size of each clipped netcdf file with xarray is 3.5 MB (it’s good size).
But when I want merge these two clipped netcdf files in first step of import product:

product_name = harp.import_product(‘…/clipped_nc_in xarray.nc’)

It return:

CLibraryError: /clipped_nc_in xarray.nc: unsupported product

What is my problem? Why netcdf file created with xarray does not supported with harp?

Regarding the size difference, be aware that with xarray you are only reading the variables from the ‘PRODUCT’ netcdf group. HARP will read a lot of other variables as well. If you are interested in only a e.g. single variable (together with the time/lat/lon axis) then you should use the keep() operation as part of your HARP operations to limit the number of ingested variables. This will greatly reduce the size of the output product (and speed up the bin_spatial() operation).

In addition, you are using a target grid resolution that is slightly higher than the native S5P resolution. This is not necessarily a bad thing, but you will then of course end up with a higher file size than if you would just filter the original satellite pixels using a lat/lon filter (as you are doing with xarray).

Regarding reading of data saved by xarray in HARP, note that the file that xarray writes is neither a valid S5P product, nor is it a valid HARP product. This means that HARP will not be able to import it.

1 Like

Hi @svniemeijer, thanks for your complete explanation, but what is native S5P resolution? I could not find it in decimal degree.
I clipped original (downloaded) S5P file in xarray and then mereged in harp. If I want to merge original (downloaded) S5P file (without clip) in harp, what is correct operations with native S5P resolution?

The pixel resolution of S5P is about 3.5km x 5km. It is not defined in decimal degrees (since the length of a longitudinal degree depends on the latitude).

If you want to keep the data in the original satellite grid resolution, then don’t perform the bin_spatial() operation.

1 Like

Without bin_spatial, the longitude and latitude dimensions removed in output nc file.
with these cods:
product_name = harp.import_product( base_file + ext, operations="latitude > 24.8 [degree_north]; latitude < 39.9 [degree_north]; longitude > 43.8 [degree_east]; longitude < 63.4 [degree_east]; derive(longitude {longitude}); derive(latitude {latitude})")

OR

product_name = harp.import_product( base_file + ext, operations="latitude > 24.8 [degree_north]; latitude < 39.9 [degree_north]; longitude > 43.8 [degree_east]; longitude < 63.4 [degree_east]")
OR

product_name = harp.import_product( base_file + ext)
What is my problem?

This is because the S5P data does not have a latitude and longitude dimension (it has latitude and longitude variables). The dimensions in the S5P product are scanline and groundpixel. But those are more virtual raster dimensions (since the S5P pixels actually have a small overlap, they don’t form an actual grid). So HARP flattens all pixels into a single time dimension when reading the data.

Your xarray data also does not have latitude and longitude dimensions.

1 Like

Thanks for your clear explanation. Based on your comment, with above code, have I latitude and longitude variables in output nc file? If not, how can I to have them?
When I open output nc file with xarray, I have only one number for latitude and longitude as follow:

Without bin_spatial operation, output netcdf file is 14.2 KB, and lost longitude and latitude!