Truncated file error with HARP for 2022 RPRO L2__NO2 Files for 2018-2019

I keep getting a truncated file error when trying to process certain files from 2018 to 2019 using this code. The “week relative paths” is the paths to the s3 bucket for all the files in a given week. The code works if I remove the files that are causing issues.

mean_no2=harp.import_product(week_relative_paths, harp_operations, reduce_operations=reduce_operations)
name= f’s5p-NO2_L3_weekly_averaged_{start_date_withouttime}.nc’
harp.export_product(mean_no2, filename=os.path.join(output_sentinel_dir_path, name), file_format=“netcdf”)

I read that this may be an issue with corrupted files from downloading, so I have now mounted the AWS open dataset sentinel-5p bucket on my AWS instance, so the code is pulling directly from the s3 bucket and I am still getting the exact same errors for the exact same files I was getting errors for when I was downloading the files. Does this mean that some of the 2022 reprocessed NO2 sentinel files on S3 are corrupted (Sentinel-5P Level 2 - Registry of Open Data on AWS) and if so, is there a more reliable place to pull from? Or is it some other issue that can be solved with harp?

Also, is there anyway to add code to so that it checks for truncated files before processing sothe code will run without error? I am trying to process 100+ weeks of data so this would be ideal. Right now I’ve added some code to filter out file sizes less that 300 MB which seems to be catching corrupted files for most weeks, but would love a way to filter out corrupted files more explicitly so the code can run without error (and I don’t actually filter out files that are uncorrupted).

If you get truncation errors, then this is indeed likely due to corrupted data files.

Be aware that the official place for getting L2 NO2 files is the Copernicus Data Space Ecosystem (CDSE). They have various APIs to get the data and also provide S3 access to the data (but this can only be efficiently accessed large-scale if you are within one of the cloud environments that are part of CDSE, which currently are CloudFerro and Open Telekom Cloud).

The version on AWS is at least not an official dataset. You could check whether for the products that give you problems, the file size and md5 checksum match with the entry from CDSE. If this is not the case, you could raise this using the contact details that are mentioned on the AWS S5P page that you linked.

Thank you! This is helpful. And there is no more explicit way to check for, and exclude, these corrupted files while processing using harp? I’m processing by week so would ideally like it to skip any truncated files.

Thank you!

You could try running a codacheck or harpcheck (both being commandline tools) on the products before using them in your python script(s). These should give a non-zero exit code in case the products have a problem.