How is the average of a HARP "bin" calculated? Role of weight and count variables?

I understood that since HARP 1.9 or so the HARP bin operation is no longer an arithmetic average but a weighted average.
Suppose for example a “point_distance(lat,lon,50 [km]);bin();” operations on satellite orbit files, where the satellite pixel dimensions are present (i.e., there is latitude_bounds and longitude_bounds)

The result is a weighted average. But how is it exactly calculated? By weighting according to the pixel area?

In the resulting file there are the new “weight” and “count” variables.

“weight” will be used in subsequent bin operations, correct?
Does “count” play a role in subsequent bin operations? Perhaps when the weight variable is excluded before?

For a limited number of variables, there is now a corresponding “_weight” variable:
latitude_weight, longitude_weight, solar_zenith_angle_weight, solar_azimuth_angle_weight, sensor_zenith_angle_weight, sensor_azimuth_angle_weight

What is the role of these new _weight variables? Why only for these variables?

The best documentation on this is, is in the documentation for the C library functions for harp_product_bin_spatial and harp_product_bin.

We know that this is something we need to document better.

Let me know if you have any additional questions after reading those links.

Thanks. As I use the “bin” operation, I guess “harp_product_bin” (and not harp_product_bin_spatial) is the relevant C function.
The explanation for harp_product_bin_spatial seems a bit more complete than for harp_product_bin.

“In addition, a ‘weight’ variable will be added that will contain the sum of weights for the contribution to each cell. If a variable contained NaN values then a variable specific weight variable will be created with only the sum of weights for the non-NaN entries.”

-> I guess the same holds for harp_product_bin, but replacing “to each cell” by “to each bin” ??

In harp_product_bin_spatial, area binning means that "each sample will be allocated to each lat/lon grid cell based on the amount of overlap. "

-> For harp_product_bin, there are no grid cells and so no overlap can be calculated. So I guess weighting uses surface area, not amount of overlap?

harp_product_bin_spatial actually does both temporal binning and spatial binning (although in practice you would mostly use it with a single temporal target bin).
harp_product_bin only does the temporal binning.

The documentation is pretty accurate. There is no ‘spatial overlap’ in the temporal binning case (harp_product_bin). Weight variables are only created for angular variables (to store the length of the combined unit vectors).

Does this mean that in the above example (bin() per orbit file), an arithmetic average of the individual pixels will be taken?
And would, in a subsequent bin() operation (combining several “binned-per-overpass” instances), the average be weighted using the “count” variable?

That is indeed how it will happen.

I don’t observe that both temporal and spatial binning is done when the bin_spatial operation is applied. The example I have is:

harpmerge -f hdf5 -a “bin_spatial(360,-90,0.5,720,-180,0.5);derive(latitude{latitude});derive(longitude{longitude});exclude(angle)” -ap “squash(time, (latitude,longitude));”

this gives an file with two bins in the time dimension, to average the temporal binning, I need to do a bin() operation. So, I dont understand what you mean by " harp_product_bin_spatial actually does both temporal binning and spatial binning" ?

You have to dissect what is happening.

First of all, you are calling harpmerge. This merges the result of operations done on multiple products. In your case 2 products. Per product you are performing a bin_spatial operation. So your bin_spatial does not work on the products combined. The final bin() that you want to do (but are not showing in your example) is because you want to flatten the concatenation that the merge is doing (i.e. you want to unbin, what got binned per product).

The bin_spatial that you perform on a single product, is, in fact, binning your product in time: it is putting everything in a single bin that has start time / stop time equal to that of the product. In other words, your time resolution changes (the time resolution is no longer a scanline, but an orbit).

If you would use the C-library you can actually, when gridding a single product, provide a binning grid in the time dimension when performing a harp_product_bin_spatial. See the num_time_bin and num_time_elements arguments, which behave similar to that of harp_product_bin. You could actually provide the validity (quality flag) variable for this, and then you would get separate grids per quality flag value.

thanks a lot, I think it is clear now. So I can do the following:

harpmerge -f hdf5 -ap “bin_spatial(360,-90,0.5,720,-180,0.5);”

harpmerge -f hdf5 -a “derive(latitude{latitude});derive(longitude{longitude});exclude(angle);squash(ti
me, (latitude,longitude));”

from it is clear that a temporal average (over all scanlines of both S5P files) is done and the end results is a single bin in time dimension. The second command can then perform some additional operations on

How can I get separate grids per quality flag value without using the C interface (maybe this is not possible? ). I see I can do a temporal average per quality flag as follows:

harpmerge -f hdf5 -ap “bin(validity);”

Doing a bin_spatial as an -ap means that you first merge the datasets and then perform the spatial binning. This may require too much memory to keep all the L2 data in memory.

The whole idea of doing bin_spatial as -a and then perform ‘additional operations’ on the merged data is what the -ap is actually for. There should be no need to split things in two commands.

Also, if you are not using multiple products as input, just use harpconvert.

Doing a spatial bin with binning on a temporal variable is indeed not possible as a harp operation. This would make the bin_spatial function too complex (it already has quite a few parameters).

OK, thanks for the clear explanation. I still have some question on the -ar option. So when I do the following:

harpmerge -f hdf5 -a “bin()” will contain two bins in time dimension. (since -a does the temporal binning opeation on each input S5P file)

harpmerge -f hdf5 -ap “bin()” will contain one bin in time dimension. (since -ap does the temporal binning opeation on the merged input S5P files)

But when I try the -ar option to take a temporal average, I get the following error:

harpmerge -f hdf5 -ar “bin()”
ERROR: products don’t both have variable ‘latitude_weight’

I would say should be the same as in the previous command (with -ap). But apparently I still need to do something here.

I found the answer of my own previous question. To apply the bin() operation with harpmerge in combination with the -ar option, the following can be done: (I only keep the HCHO_slant_column_number_density variable here)

harpmerge -f hdf5 -a “keep(HCHO_slant_column_number_density);bin();” -ar “bin();derive(HCHO_slant_column_number_density float {time})”

this is because the reduce operations (after -ar) acts on the merged output file that is created after each append. Therefore this -ar is memory efficient (as explained in harpmerge — HARP 1.14 documentation) .
However I dont understand why I needed to do the change from double to float (derive(HCHO_slant_column_number_density float {time}) ). This causes the result in to be of another datatype than when one should have used the -ap option.

I don’t understand your example. If I execute your command I actually get a ERROR: variables don't have the same datatype (HCHO_slant_column_number_density). So I have to leave out the type conversion of the variable.

Indeed, you 're right, sorry, I ve made some mistake here. The correct command is indeed:
harpmerge -f hdf5 -a “keep(HCHO_slant_column_number_density);bin();” -ar “bin();”