while the problems with our hosting provider are not yet resolved, the cluster should be operating nominally. that should include the proper and relatively reliable functioning of the relay/onRelay function.
note that relay/onRelay is rate limited to about 1 relay every 2 seconds (approximated with a sliding window and about 10 relays every 20 seconds). this rate limit is in place because the relay/onRelay function is expensive to provide and a developer was abusing it. the limit is implemented by dropping relay requests for a time if the rate exceeds the limit point. the limit is per sender NetConnection.
my logs indicate that over the last few minutes, 94% of all relay requests were delivered. the two likely causes for non-delivery are:
1) non-existent target peer ID
2) exceeding the relay rate limit
given the high rate of success, it is not likely that there is a system-related cause (such as a DHT partition).
Thanks for the reply. We have debugged it and it ended up being a mistake we made ourselves in refactoring the code. This was precisely why we asked this because we wanted to be certain it was something we did ourselves.
On the subject of relay rate limits though (for future reference): are the limits you describe imposed on a single netconnection or on a single ip address?