Robust Aggregation 101

This document analyzes how to aggregate multiple data reports into a single reliable estimate. We review aggregation methods and their properties, focusing on resistance to errors and manipulation.

When protocols encounter tasks that cannot be solved on-chain (oracle tasks) and require a decentralized solution, it's crucial to distinguish between two fundamentally different purposes of decentralization:

Decentralization for Better Results - When there's no single "right" answer, decentralizing the methodology itself improves outcomes. This applies to scenarios where different approaches provide complementary insights or where aggregating multiple viewpoints improves accuracy.
Decentralization for Security —When the task is well-defined but requires protection against manipulation or failure. This applies to operations requiring high reliability and fault tolerance where a single point of failure must be avoided.

The distinction between these purposes is fundamental: the first seeks to improve quality through diverse methodologies, while the second ensures integrity through distributed execution of a single, well-defined process.

This document analyzes how to aggregate multiple data reports into a single reliable estimate. We review aggregation methods and their properties, focusing on resistance to errors and manipulation, to find an optimal strategy balancing accuracy and efficiency. We place special emphasis on the weighted median since it is the most common aggregation method for price feed oracles.

Problem Definition

We first consider a simple setting where data origins from a single-source.

Let $p_t$ be the relevant quantity at time $t$ , e.g., the BTC/USD price. Notice that $p_t$ is unknown. Instead, we can access an estimate $r_t$ from a known and agreed-upon source, for instance, interactive brokers. A set of $N$ fetchers fetch $r_t^1,r_t^2,...,r_t^N$ for us, where $r_t^i$ denotes the quantity reported by fetcher $i$ . The aggregate $S_t$ is the number we forward as our estimate of $r_t$ (and essentially $p_t$ ).

The question is how to aggregate $r_t^1,r_t^2,...,r_t^N$ into one number, $S_t$ , representing the price according to that source at time $t$ .

Considerations

Time-variance. Since time is continuous, fetchers access the source at slightly different times. We don’t expect the time differences to be significant; more particularly, the time differences should not exceed one second, which is the rate of our blockchain i\o.
Simplicity. It is crucial due to runtime considerations and explaining to users how our mechanism works (KISS/Occam’s razor).
(Honest) mistakes. Although we have incentive mechanisms to punish/reward fetchers’ behavior, mistakes are unavoidable. For instance, downtime, latency, etc. These can happen even if fetchers are honest and thus should be accounted for.
Malicious behavior. Our solution should be as robust as possible to attacks. Namely, it should minimize the consequences of an attack and facilitate punishing attackers.

To quantify the last point, the Breakdown Point of an aggregate is the minimal ratio of reports by a malicious actor that allows it to achieve arbitrary deviation from $r_t$ .

Possible solutions

We review below several options for our selection of an aggregation. All of them are special cases of minimizing an objective function. While there are infinite such functions, our analysis focuses only on median, average, and their trimmed counterparts.

Simple Average (Mean)
- Pros: Easy to calculate; treats all reports equally; good for consistent data without outliers.
- Cons: Can be skewed by outliers: its breakdown point is zero.
Weighted Average
- Pros: Accounts for the varying significance of each report (e.g., based on stake); more accurate if reports are not equally reliable.
- Cons: More complex to calculate, can still be skewed.
Median
- Pros: Less affected by outliers than the mean; simple to understand and calculate.
- Cons: May not reflect the influence of large/small prices if they are outliers.
Mode (most common value)
- Pros: Represents the most frequently occurring price; useful in markets with standard pricing.
- Cons: Vary widely or if there is no repeating price.
Trimmed Mean
- Pros: Excludes outliers by trimming a specified percentage of the highest and lowest values before averaging; balances the influence of outliers.
- Cons: Arbitrariness in deciding what percentage to trim; could exclude relevant data.
Quantile-based Aggregation
- Pros: Can focus on a specific part of the distribution (e.g., median is the 50% quantile); useful for risk management strategies.
- Cons: Not representative of the entire data set; can be as sensitive to outliers as the mean.

Weighted median

weighted median enjoys the robustness of the median and the ability to consider different significance levels as the weighted average. Its breakdown point is 50% of the weight; below that, an adversary can only manipulate the result within the range of correctly reported values (as we prove later on). The weighted median allows us to incorporate the stake of the different fetchers.

Mathematically, let $w^i$ be the (positive) stake of fetcher $i$ , and assume that $\sum_{i=1}^Nw^i=1$ . Also, for simplicity, assume that $r_t^1,r_t^2,...,r_t^N$ are sorted. The weighted median is an element $r_t^k$ such that

\sum_{i=1}^{k-1} w^i \leq \frac{1}{2} \text{ and } \sum_{i=k+1}^{N} w^i \leq \frac{1}{2}

Therefore, our aggregate $A_t=r_t^k$ for such $k$ .

To demonstrate the robustness of the weighted median, we present the following theorem. It proves that as long as the majority of the stake belongs to honest fetchers, the aggregate will always be between the honest reports; namely, an attacker with a minority weight (stake) cannot shift the aggregate too much.

Theorem: Let $H$ be the set of honest fetchers, for $H\subset [N]$ such that $\sum_{i\in H}w^i > \frac 1 2$ , and let $M$ be the set of malicious fetchers, for $M\subset[N]$ such that .

Then, the weighted median aggregate $A_t$ always satisfies

A_t\in [\min_{i\in H} r^i_t, \max_{i\in H} r^i_t].

Recall that in the reasonable case we do not expect high variance among (honest) reports; thus, the interval $[\min_{i\in H} r^i_t, \max_{i\in H} r^i_t]$ will be small. This ensures that our aggregate is robust to manipulations.

Averaging Over Time

Recall that prices are susceptible to noise and volatility. Therefore, financial applications often average prices over time. Well-known methods include Moving Average, Exponential Smoothing, Time-weighted Average Price (TWAP), and Volume-weighted Average Price (VWAP).

Our current service does not implement such time averages. We allow our customers the flexibility of the computation at their end.

Multi-Source Aggregation

We are now facing a similar problem, but now each quantity is given by a source (and not a fetcher). We have a set of $N$ sources $1,...,N$ , where each source has a price $S_t^i$ . Along the different prices $S_t^1,S_t^2,...,S_t^N$ we have additional weight $w_i$ per source. Weights capture our confidence in that source, volume, liquidity, etc. The weight-adjusted price is given by

A_t=\frac{\sum_{i=1}^N w^i S^i_t}{\sum_{i=1}^N w^i}

There are several ways by which we can set the weights.

Volume-Weighted Average Price (VWAP):
- Description: VWAP is calculated by taking the dollar amount of all trading periods and dividing it by the total trading volume for the current day. In your case, it involves weighting each source's rate by its volume, giving more influence to sources with higher trading volumes.
- Advantages: Reflects more liquidity and is a common benchmark used by traders. It gives a fair reflection of market conditions over the given period.
- Disadvantages: More susceptible to volume spikes, which can distort the average price.
Liquidity-Adjusted Weighting:
- Description: Here, the rate from each source is weighted based on its liquidity. This method requires a clear definition and measurement of liquidity, which can include factors like bid-ask spread, market depth, and the speed of price recovery after a trade.
- Advantages: Provides a more realistic view of the market by acknowledging that more liquid markets better reflect the true market price.
- Disadvantages: Liquidity can be harder to measure accurately and may vary quickly, making it challenging to maintain an accurate aggregate price in real-time.

Summary

This document explores single-source price aggregation in oracle systems, covering both result improvement and security aspects of decentralization. It analyzes various aggregation methods, focusing on weighted median for its manipulation resistance and stake-based weighting capabilities, while also examining time-based averaging and multi-source aggregation approaches.

The document highlights breakdown points in aggregation methods, demonstrates weighted median's security when honest fetchers hold majority stake, and evaluates trade-offs between different aggregation strategies.

PreviousTWAP NexteOracle Cryptographic Broadcaster

Last updated 1 month ago