I’m searching a way to account for the number of observations when spatially binning data with harp some random error variables. As you know, those errors are generally considered to be reduced by sqrt(N_obs) when averaging a series of obs.
This is obviously relevant when gridding at coarse resolution, which leads each grid cell to contain several satellite pixels.
Is it the case now and how should I switch this on since my *_uncertainty_random variables are clearly simply binned (averaged) for now?
As an alternative, I could reduce the binned *uncertainty_random variables myself but I would need a count variable to do so. This is not the case for surface area-weighted binning, which only provides a weight variable (1 if the grid cell is entirely filled in by several smaller satellite pixels, less than 1 if part of the cell is not fully covered or even larger than 1 in case several orbits overpasses the grid cell).
So, at the moment I don’t see a variable produced by harp which could help me to do compute the appropriate random error for the binned variables. Or can HARP do this automatically?
According to the release history, Releases · stcorp/harp · GitHub, it was introduced in HARP 1.12, 2020-10-29.
"Added uncertainty propagation to bin() and bin_spatial() operations
(using uncorrelated assumption for bin() and correlated for bin_spatial()). "
So since you are applying here bin_spatial, the default setting might not reduce *_uncertainty_random variables.
You could try to do “set( propagate_uncertainty, uncorrelated)” although it is not clear to me yet if this will make a difference, since uncorrelated is the default setting according to the documentation of the set operation.
The approach in HARP is that bin_spatial will always assume full correlation for the uncertainty propagation. For the bin operation, the default is to assume a fully uncorrelated uncertainty, unless you use set(propagate_uncertainty,correlated). This applies to all variables with the name uncertainty in them.
The exception is variables with the name uncertainty_systematic in them which will always use fully correlated propagation for bin.
For the spatial binning I am even not sure whether you should be able to consider the random uncertainty as fully uncorrelated. If a satellite pixel overlaps with two adjacent grid cells, then, even if these grid cells are much larger than the satellite pixel, you will still be introducing a correlation between the values of these neighbouring grid cells. And that is not even considering that satellite pixels often have some overlap among themselves.
The approach that might be more sensible, if you want to reduce the uncertainty through reducing the spatial resolution, is to use coadding, making sure pixels contribute only to one target area. If you then regrid these co-added areas into a regular grid (using e.g. bin_spatial) as a post processing step, we can use the fully correlated assumption for the propagation again.
Co-adding is not a function that is in HARP, but should be relatively easy to implement yourself as a routine if your satellite pixels form a nice grid (you just have to stitch the latitude/longitude boundary coordinates and can use whatever uncertainty propagation technique you want).
As a side note, for the L3 maps on the S5P-PAL Mapping Portal we create grids per orbit using the (hardcoded) fully correlated assumption, turn on set(propagate_uncertainty,correlated) to bin the grids of multiple orbits of one day into a daily grid. And then average the daily grids into multi-day grids by using the uncorrelated assumption.
In other words, we assume all measurements to be spatially fully correlated when falling within the same day and fully uncorrelated when combining multiple days.
This is again still a heavy approximation, since there is always some level of correlation at different time scales, but I think is a sensible default implementation.
Thanks for the prompt reply. I see and understand your different arguments.
The random variables we want to bin are really random from a satellite observation to another since they only represent the instrumental shot-noise.
So I think this is important for us to find a way to take into account the reduction of those random uncertainties when binning data in a coarse grid. Otherwise the reported errors will be way too high. For example, a cell of 0.2°x0.2° would include roughly 20 S5p pixels leading to a reduction of the random error by a factor 4-5 owing to the binning process.
Note that I agree with you that the satellite pixels in overlap of two grid cells introduce some spatial correlation, but it’s likely marginal compared to the reduction due to the binning.
When binning in the time dimension, your proposed approach totally makes sense. But as we would need also this in the spatial dimension, I’m afraid we’ll need to investigate this co-adding option. An alternative is to work with high resolution grids and then to rebin at coarser resolution using home-made tools. But in this case, we deviate from the harp world, which is a pity.
I am not sure if I agree here (or perhaps I misunderstand the argument). The column measurements will indeed be correlated, but the error on them will be uncorrelated, at least when _uncertainty_random indeed represents truly a fully random error (spatially and temporally).
I think it would make sense to make the uncertainty propagation also for bin_spatial changeable by using ‘set’, as is now the case for ‘bin’. Products like HCHO column have a large noise error, so reduction of _uncertainty_random by sqrt(N) is sensible.
Perhaps here is the solution for Christophe.
(below is adapted from an explanation Sander sent me a few years ago)
First create daily grids using only the orbits of that day as input to a harpmerge. You would call bin_spatial() as normal, but as part of the ‘-ar’ (reduce_operations) you would use ‘set(“propagate_uncertainty”, “uncorrelated”);bin()’
Then you would combine these daily grids into a monthly grid using harpmerge and a regular -ar ‘bin()’ reduce operation.
(Of course the explicit use of set above is not needed since I inserted the default value. It was just to make the intention clear)
In the original explanation Sander used ‘set(“propagate_uncertainty”, “correlated”);bin()’ precisely because he wanted to have correlated error propagation of the daily bin operation. As Christophe now wants uncorrelated error propagation, would the above approach work?
The above doesn’t work, because it would then also imply an uncorrelated propagation of the total uncertainty. And we don’t want that to happen. (note that most of the uncertainty propagation functionality in harp is currently more targeted towards propagating the total uncertainty; proper support for uncertainty propagation and splitting the random/systematic parts is a rather difficult topic).
Also, be aware that correlation does not only come from uncertainties being correlated, but also from the values themselves being correlated (see also the wikipedia article on this).
I do have a relatively easy way forward though if you really want to do the approach you mentioned.
You could create a python script that would:
ingest each product with harp
square all the random uncertainties
call harp.execute_operations to perform the bin_spatial regridding
multiple all random uncertainties by their weight
take the square root of the random uncertainties
and divide the resulting uncertainties again by their weight
And what if in the product there are only uncertainty_systematic and uncertainty_random variables available (and not a total uncertainty, which can be thrown away before the binning). From your explanation above, I understand that uncertainty_systematic will be fully correlated propagation for bin. While uncertainty_random will reduce (since it has the term uncertainty in it and it is not uncertainty_systematic).
Sure. But we cannot introduce an option that only works for your case. If we introduce an option in HARP it needs to have defined and proper behaviour for everything that a user can potentially throw at HARP as input.
Throwing the total uncertainty away is not proper behaviour.
The correlation between which quantities (indicated above in bold) are you referring to? And what do you mean exactly by values?
I agree that there will be correlation between true column values of neighbouring pixels, and also between measured column values. I also agree that there will be correlation between the uncertainties of neighbouring pixels (the uncertainty of pixel 1 is likely to be comparable to the uncertainty of pixel 2), even if the error of both pixels is totally uncorrelated.
But for uncertainty propagation, I would think only error correlation is relevant, not column value correlation or uncertainty correlation.
When doing spatial binning of a very noisy product (which per-pixel easily goes below zero), we see that the below-zero fluctuations get reduced. Which would not be in agreement with a non-reducing uncertainty.
In the wikipedia article I guess the following is the important sentence, right?
If the uncertainties are correlated then covariance must be taken into account. Correlation can arise from two different sources. First, the measurement errors may be correlated. Second, when the underlying values are correlated across a population, the uncertainties in the group averages will be correlated.
The point is that the propagate_uncertainty option only applies to temporal binning (bin) and not to spatial binning (bin_spatial). And that the option is mainly intended for the total combined uncertainty (not for the random/systematic parts; properly propagating that is a mess, so whatever we do now will be suboptimal anyway).
Propagating the total uncertainty in an uncorrelated way in the temporal domain then has its uses, but for the spatial domain not.
This would be something to discuss as part of the open ticket. Until there is a clear path forward (if there ever will be) I am very hesitant to introduce additional hacks to try to separate the random/systematic parts in the uncertainty propagation.
Alternatively, would there be a way to save the number of valid (entire) pixels covering the grid cell, or, equivalently, the fractional coverage of the grid cell?
We would like to save this information not only for the uncertainties, but also as information to the user of the grid: “how many satellite observations” are included in the mean? A count per grid cell?
This fractional coverage is already available through the weight variables.
If you really want a ‘count’ and you are using a very large grid cell area, you could switch to spatial binning based on mid points (i.e. remove the latitude/longitude boundary variables, and only keep the satellite pixel mid points).