A Shannon-Theoretic Approach to the Storage–Retrieval Trade-Off in PIR Systems

Tian, Chao; Sun, Hua; Chen, Jun

doi:10.3390/info14010044

Open AccessArticle

A Shannon-Theoretic Approach to the Storage–Retrieval Trade-Off in PIR Systems

by

Chao Tian

^1,*

,

Hua Sun

²

and

Jun Chen

³

¹

Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77845, USA

²

Department Electrical Engineering, University of North Texas, Denton, TX 76203, USA

³

Department of Electrical and Computer Engineering, McMaster University, Hamilton, ON L8S 4K1, Canada

^*

Author to whom correspondence should be addressed.

Information 2023, 14(1), 44; https://0-doi-org.brum.beds.ac.uk/10.3390/info14010044

Submission received: 8 November 2022 / Revised: 22 December 2022 / Accepted: 5 January 2023 / Published: 11 January 2023

(This article belongs to the Special Issue Advanced Technologies in Storage, Computing, and Communication)

Download

Browse Figures

Versions Notes

Abstract

:

We consider the storage–retrieval rate trade-off in private information retrieval (PIR) systems using a Shannon-theoretic approach. Our focus is mostly on the canonical two-message two-database case, for which a coding scheme based on random codebook generation and the binning technique is proposed. This coding scheme reveals a hidden connection between PIR and the classic multiple description source coding problem. We first show that when the retrieval rate is kept optimal, the proposed non-linear scheme can achieve better performance over any linear scheme. Moreover, a non-trivial storage-retrieval rate trade-off can be achieved beyond space-sharing between this extreme point and the other optimal extreme point, achieved by the retrieve-everything strategy. We further show that with a method akin to the expurgation technique, one can extract a zero-error PIR code from the random code. Outer bounds are also studied and compared to establish the superiority of the non-linear codes over linear codes.

Keywords:

capacity; information theory; multiple descriptions; privacy

1. Introduction

Private information retrieval (PIR) addresses the situation of storing K messages of L-bits each in N databases, with the requirement that the identity of any requested message must be kept private from any one (or any small subset) of the databases. The early works were largely computer science theoretic [1], where

L = 1

, and the main question is the scaling law of the retrieval rate in terms of

(K, N)

.

The storage overhead in PIR systems has been studied in the coding and information theory community from several perspectives using mainly two problem formulations. Shah et al. [2] considered the problem when N is allowed to vary with L and K, and obtained some conclusive results. In a similar vein, for

L = 1

, Fazeli et al. [3] proposed a technique to convert any linear PIR code to a new one with low storage overhead by increasing N. Other notable results along this line can be found in [4,5,6,7,8,9].

An information theoretic formulation of the PIR problem was considered in [10], where L is allowed to increase, while

(N, K)

are kept fixed. Important properties on the trade-off between the storage rate and retrieval rate were identified in [10], and a linear code construction was proposed. In this formulation, even without any storage overhead constraint, characterizing the minimum retrieval rate in the PIR systems is nontrivial, and this capacity problem was settled in [11]. Tajeddine et al. [12] considered the capacity problem when the message is coded across the databases with a maximum-distance separable (MDS) code, which was later solved by Banawan and Ulukus [13]. Capacity-achieving code designs with optimal message sizes were given in [14,15]. Systems where servers can collude were considered in [16]. There have been various extensions and generalizations, and the recent survey article [17] provides a comprehensive overview on efforts following this information theoretic formulation.

In many existing works, the storage component and the PIR component are largely designed separately, usually by placing certain structural constraints on one of them, e.g., the MDS coding requirement for the storage component [13], or the storage is uncoded [18]; moreover, the code constructions are almost all linear. The few exceptions we are aware of are [19,20,21]. In this work, we consider the information theoretic formulation of the PIR problem, without placing any additional structural constraints on the two components, and explicitly investigate the storage–retrieval trade-off region. We mostly focus on the case

N = K = 2

here since it provides the most important intuition; we refer to this as the

(2, 2)

PIR system. Our approach naturally allows the joint design of the two components using either linear or non-linear schemes.

The work in [19] is of significant relevance to our work, where the storage overhead was considered in both single-round and multi-round PIR systems, when the retrieval rate must be optimal. Although multi-round PIR has the same capacity as single-round PIR, it was shown that at the minimum retrieval rate, a multi-round,

ϵ

-error, non-linear code can indeed break the storage performance barrier of an optimal single-round, zero error, linear code. The question of whether all the three differences are essential to overcome this barrier was left as an open question.

In this work, we show that a non-linear code is able to achieve better performance than the optimal linear code in the single-round zero-error

(2, 2)

PIR system, over a range of the storage rates. This is accomplished by providing a Shannon-theoretic coding scheme based on random codebook generation and the binning technique. The proposed scheme at the minimum retrieval rate is conceptually simpler, and we present it as an explicit example. The general inner bound is then provided, and we show an improved trade-off can be achieved beyond space-sharing between the minimum retrieval rate code and the other optimal extreme point. By leveraging a method akin to the expurgation technique, we further show that one can extract a zero-error deterministic PIR code from the random

ϵ

-error PIR code. Outer bounds are also studied for both general codes and linear codes, which allow us to establish conclusively the superiority of non-linear codes over linear codes. Our work essentially answers the open question in [19], and shows that, in fact, only non-linearity is essential in breaking the aforementioned barrier.

A preliminary version of this work was presented first in part in [22]. In this updated article, we provide a more general random coding scheme, which reveals a hidden connection to the multiple description source coding problem [23]. Intuitively, we can view the retrieved message as certain partial reconstruction of the full set of messages, instead of a complete reconstruction of a single message. Therefore, the answers from the servers can be viewed as descriptions of the full set of messages, which are either stored directly at the servers or formed at the time of request, and the techniques seen in multiple description coding become natural in the PIR setting. Since the publication of the preliminary version [22], several subsequent efforts have been made in studying the storage–retrieval trade-off in the PIR setting, which provided stronger and more general information theoretic outer bounds and several new linear code constructions [20,21,24]. However, the Shannon-theoretic random coding scheme given in [22] remains the best-performing for the

(2, 2)

case, which motivates us to provide the general coding scheme in this work and to make the connection to multiple description source coding more explicit. It is our hope that this connection may bring existing coding techniques for the multiple description problem to the study of the PIR problem.

2. Preliminaries

The problem we consider is essentially the same as that in [11], with the additional consideration on the storage overhead constraint at the databases. We provide a formal problem definition in the more traditional Shannon-theoretic language to facilitate subsequent treatment. Some relevant results on this problem are also reviewed briefly in this section.

2.1. Problem Definition

There are two independent messages, denoted as

W_{1}

and

W_{2}

, in this system, each of which is generated uniformly at random in the finite field

F_{2}^{L}

, i.e., each message is an L-bit sequence. There are two databases to store the messages, which are produced by two encoding functions operating on

(W_{1}, W_{2})

:

\begin{matrix} ϕ_{n} : F_{2}^{L} \times F_{2}^{L} \to F_{2}^{α_{n}}, n = 1, 2, \end{matrix}

where

α_{n}

is the number of storage symbols at database-n,

n = 1, 2

, which is a deterministic function of L, i.e., we are using fixed length codes for storage. We write

S_{1} = ϕ_{1} (W_{1}, W_{2})

and

S_{2} = ϕ_{2} (W_{1}, W_{2})

. When a user requests message-k, it generates two queries

(Q_{1}^{[k]}, Q_{2}^{[k]})

to be sent to the two databases, randomly in the alphabet

Q \times Q

. Note that the joint distribution satisfies the condition

\begin{matrix} P_{W_{1}, W_{2}, Q_{1}^{[k]}, Q_{2}^{[k]}} = P_{W_{1}, W_{2}} P_{Q_{1}^{[k]}, Q_{2}^{[k]}}, k = 1, 2, \end{matrix}

(1)

i.e., the messages and the queries are independent. The marginal distributions

P_{W_{1}, W_{2}}

and

P_{Q_{1}^{[k]}, Q_{2}^{[k]}}

,

k = 1, 2

, thus fully specify the randomness in the system.

After receiving the queries, the databases produce the answers to the query via a set of deterministic functions:

\begin{matrix} φ_{n}^{(q)} : F_{2}^{α_{n}} \to F_{2}^{β_{n}^{(q)}}, q \in Q, n = 1, 2 . \end{matrix}

(2)

We also write the answers

A_{n}^{[k]} = φ_{n}^{(Q_{n}^{[k]})} (S_{n})

,

n = 1, 2

. The user, with the retrieved information, wishes to reproduce the desired message through a set of decoding functions

\begin{matrix} ψ^{(k, q_{1}, q_{2})} : F_{2}^{β_{1}^{(q_{1})}} \times F_{2}^{β_{2}^{(q_{2})}} \to F_{2}^{L} . \end{matrix}

(3)

The outputs of the functions

{\hat{W}}_{k} = ψ^{(k, Q_{1}^{[k]}, Q_{2}^{[k]})} (A_{1}^{[k]}, A_{2}^{[k]})

are essentially the retrieved messages. We require the system to retrieve the message correctly (zero error), i.e.,

{\hat{W}}_{k} = W_{k}

for

k = 1, 2

.

Alternatively, we can require the system to have a small error probability. Denote the average probability of coding error of a PIR code as

\begin{matrix} P_{e} = 0.5 \sum_{k = 1, 2} P_{W_{1}, W_{2}, Q_{1}^{[k]}, Q_{2}^{[k]}} (W_{k} \neq {\hat{W}}_{k}) . \end{matrix}

(4)

An

(L, α_{1}, α_{2}, β_{1}, β_{2})

ϵ

-error PIR code is defined similar as a (zero-error) PIR code, except that the correctness condition is replaced by the condition that the probability of error

P_{e} \leq ϵ

.

Finally, the privacy constraint stipulates that the identical distribution condition must be satisfied:

\begin{matrix} P_{Q_{n}^{[1]}, A_{n}^{[1]}, S_{n}} = P_{Q_{n}^{[2]}, A_{n}^{[2]}, S_{n}}, n = 1, 2 . \end{matrix}

(5)

Note that one obvious consequence is that

P_{Q_{n}^{[1]}} = P_{Q_{n}^{[2]}} ≜ P_{Q_{n}}

, for

n = 1, 2

.

We refer to the code, which is specified by two probability distributions

P_{Q_{1}^{[k]}, Q_{2}^{[k]}}

,

k = 1, 2

, and a valid set of coding functions

{ϕ_{n}, φ_{n}^{(q)}, ψ^{k, q_{1}, q_{2}}}

that satisfy both the correctness and privacy constraints, as an

(L, α_{1}, α_{2}, β_{1}, β_{2})

PIR code, where

β_{n} = E_{Q_{n}} [β_{n}^{(Q_{n})}]

, for

n = 1, 2

.

Definition 1.

A normalized storage–retrieval rate pair

(\bar{α}, \bar{β})

is achievable, if for any

ϵ > 0

and sufficiently large L, there exists an

(L, α_{1}, α_{2}, β_{1}, β_{2})

PIR code, such that

\begin{matrix} L (\bar{α} + ϵ) \geq \frac{1}{2} (α_{1} + α_{2}), L (\bar{β} + ϵ) \geq \frac{1}{2} (β_{1} + β_{2}) . \end{matrix}

(6)

The collection of the achievable normalized storage–retrieval rate pair

(\bar{α}, \bar{β})

is the achievable storage–retrieval rate region, denoted as

R

.

Unless explicitly stated, the rate region

R

is used for the zero-error PIR setting. In the definition above, we used the average rates

(\bar{α}, \bar{β})

across the databases instead of the individual rate vectors

\frac{1}{n} (α_{1}, α_{2}, E_{Q_{1}} [β_{1}^{(Q_{1})}], E_{Q_{2}} [β_{2}^{(Q_{2})}])

. This can be justified using the following lemma.

Lemma 1.

If an

(L, α_{1}, α_{2}, β_{1}, β_{2})

PIR code exists, then a

(2 L, α, α, β, β)

PIR code exists, where

\begin{matrix} α = α_{1} + α_{2}, β = β_{1} + β_{2} . \end{matrix}

(7)

This lemma can essentially be proved by a space-sharing argument, the details of which can be found in [19]. The following lemma is also immediate using a conventional space-sharing argument.

Lemma 2.

The region

R

is convex.

2.2. Some Relevant Known Results

The capacity of a general PIR system with K messages and N databases is identified in [11] as

\begin{matrix} C = \frac{1 - 1 / N}{1 - 1 / N^{K}}, \end{matrix}

(8)

which in our definition corresponds to the case when

\bar{β}

is minimized, and the proposed linear code achieves

(\bar{α}, \bar{β}) = (K, (1 - 1 / N^{K}) / (N - 1))

. The capacity of MDS-code PIR systems was established in [13]. In the context of the storage–retrieval trade-off, this result can be viewed as providing the achievable trade-off pairs

\begin{matrix} (\bar{α}, \bar{β}) = (t, \frac{1 - t^{K} / N^{K}}{N - t}), t = 1, 2, \dots, N . \end{matrix}

(9)

However, when specialized to the

(2, 2)

PIR problem, this does not provide any improvement over the space-sharing strategy between the trivial code of retrieval-everything and the code in [11]. By specializing the code in [11], it was shown in [19] that for the

(2, 2)

PIR problem, at the minimal retrieval value

\bar{β} = 0.75

, the storage rate

{\bar{α}}_{l} = 1.5

is achievable using a single-round, zero-error linear code, and in fact, it is the optimal storage rate that any single-round, zero-error linear code can achieve.

One of the key observations in [19] is that a special coding structure appears to be the main difficulty in the

(2, 2)

PIR setting, which is illustrated in Figure 1. Here, message

W_{1}

can be recovered from either

(X_{1}, Y_{1})

or

(X_{2}, Y_{2})

, and message

W_{2}

can be recovered from either

(X_{1}, Y_{2})

or

(X_{2}, Y_{1})

;

(X_{1}, X_{2})

is essentially

S_{1}

and is stored at database-1, and

(Y_{1}, Y_{2})

is essentially

S_{2}

and is stored at database-2. It is clear that we can use the following strategy to satisfy the privacy constraint: when message

W_{1}

is requested, with probability

1 / 2

, the user queries for either

(X_{1}, Y_{1})

or

(X_{2}, Y_{2})

; for message 2, with probability

1 / 2

, the user queries for either

(X_{1}, Y_{2})

or

(X_{2}, Y_{1})

. More precisely, the following probability distribution

P_{Q_{1}^{[1]}, Q_{2}^{[1]}}

and

P_{Q_{1}^{[2]}, Q_{2}^{[2]}}

can be used:

\begin{matrix} P_{Q_{1}^{[1]}, Q_{2}^{[1]}} = \{\begin{matrix} 0.5 & (Q_{1}^{[1]}, Q_{2}^{[1]}) = (11) \\ 0.5 & (Q_{1}^{[1]}, Q_{2}^{[1]}) = (22) \end{matrix}, \end{matrix}

(10)

and

\begin{matrix} P_{Q_{1}^{[2]}, Q_{2}^{[2]}} = \{\begin{matrix} 0.5 & (Q_{1}^{[2]}, Q_{2}^{[2]}) = (12) \\ 0.5 & (Q_{1}^{[2]}, Q_{2}^{[2]}) = (21) \end{matrix} . \end{matrix}

(11)

2.3. Multiple Description Source Coding

The multiple description source coding problem [23] considers compressing a memoryless source S into a total of M descriptions, i.e., M compressed bit sequences such that the combinations of any subset of these descriptions can be used to reconstruct the source S to guarantee certain quality requirements. The motivation of this problem is mainly to address the case when packets can be dropped randomly on a communication network.

Denote the coding rate for each description as

R_{i}

,

i = 1, 2, \dots, M

. A coding scheme was proposed in [25], which leads to the following rate region. Let

U_{1}, U_{2}, \dots, U_{M}

be M random variables jointly distributed with S, then the following rates

(R_{1}, R_{2}, \dots, R_{M})

and distortions

(D_{A}, A \subseteq {1, 2, \dots, M})

are achievable:

\begin{matrix} \sum_{i \in A} R_{i} & \geq \sum_{i \in A} H (U_{i}) - H ({U_{i}, i \in A} | S), A \subseteq {1, 2, \dots, M}, \end{matrix}

(12)

\begin{matrix} D_{A} & \geq E [d (S, f_{A} (U_{i}, i \in A))], A \subseteq {1, 2, \dots, M} . \end{matrix}

(13)

Here,

f_{A}

is a reconstruction mapping from the random variables

{U_{i}, i \in A}

to the reconstruction domain,

d (\cdot, \cdot)

is a distortion metric that is used to measure the distortion, and

D_{A}

is the distortion achievable using the descriptions in the set

A

. Roughly speaking, the coding scheme requires generating approximately

2^{n R_{i}}

length-n codewords in an i.i.d. manner using the marginal distribution

U_{i}

for each

i = 1, 2, \dots, M

, and the rate constraints ensure that when n is sufficiently large, with overwhelming probability there is a tuple of M codewords

(u_{1}^{n}, u_{2}^{n}, \dots, u_{M}^{n})

, one in each codebook constructed earlier, that are jointly typical with the source vector

S^{n}

. In this coding scheme, the descriptions are simply the codeword indices of these codewords in these codebooks. For a given joint distribution

(S, U_{1}, U_{2}, \dots, U_{M})

, we refer to the rate region in (12) as the MD rate region

R_{M D} (S, U_{1}, U_{2}, \dots, U_{M})

, and the corresponding random code construction the MD codebooks associated with

(S, U_{1}, U_{2}, \dots, U_{M})

.

The binning technique [26] can be applied in the multiple description problem to provide further performance improvements, particularly when not all the combinations of the descriptions are required to satisfy certain performance constraints, but only a subset of them are; this technique was previously used in [27,28] for this purpose. Assume that only the subsets of descriptions

A_{1}, A_{2}, \dots, A_{T} \subseteq {1, 2, \dots, M}

have distortion requirements associated with the reconstructions using these descriptions, which are denoted as

D_{A_{i}}

,

i = 1, 2, \dots, T

. Consider the MD codebooks associated with

(S, U_{1}, U_{2}, \dots, U_{M})

at rates

(R_{1}^{'}, R_{2}^{'}, \dots, R_{M}^{'}) \in R_{M D} (S, U_{1}, U_{2}, \dots, U_{M})

, then assign the codewords in the i-th codebook uniformly at random into

2^{n R_{i}}

bins with

0 \leq R_{i} \leq R_{i}^{'}

. The coding rates and distortions that satisfy the following constraints simultaneously for all

A_{i}, i = 1, 2, \dots, T

are achievable:

\begin{matrix} \sum_{j \in J} (R_{j}^{'} - R_{j}) & \leq \sum_{j \in J} H (U_{j}) - H (\{U_{j}, j \in J\} | \{U_{j^{'}}, j^{'} \in A_{i} \ J\}), \forall J \subseteq A_{i}, \end{matrix}

(14)

\begin{matrix} D_{A_{i}} & \geq E [d (S, f_{A_{i}} (U_{j}, j \in A_{i}))] . \end{matrix}

(15)

We denote the collection of such rate vectors

(R_{1}, R_{2}, \dots, R_{M}, R_{1}^{'}, R_{2}^{'}, \dots, R_{M}^{'})

as

R_{M D}^{*} ((S, U_{1}, U_{2}, \dots, U_{M}), ({U_{j}, j \in A_{i}}, i = 1, 2, \dots, T))

, and refer to the corresponding codebooks as the MD* codebooks associated with the random variables

(S, U_{1}, U_{2}, \dots, U_{M})

and the reconstruction sets

(A_{1}, A_{2}, \dots, A_{T})

.

3. A Special Case: Slepian–Wolf Coding for Minimum Retrieval Rate

In this section, we consider the minimum-retrieval-rate case, and show that non-linear and Shannon-theoretic codes are beneficial. We will be rather cavalier here and ignore some details, in the hope of better conveyance of the intuition. In particular, we ignore the asymptotic-zero probability of error that is usually associated with a random coding argument, but this will be addressed more carefully in Section 4.

Let us rewrite the L-bit messages as

\begin{matrix} W_{k} = (V_{k} [1], \dots, V_{k} [L]) ≜ V_{k}^{L}, k = 1, 2 . \end{matrix}

(16)

The messages can be viewed as being produced from a discrete memoryless source

P_{V_{1}, V_{2}} = P_{V_{1}} \cdot P_{V_{2}}

, where

V_{1}

and

V_{2}

are independent uniform-distributed Bernoulli random variables.

Consider the following auxiliary random variables:

\begin{matrix} X_{1} ≜ V_{1} \land V_{2}, X_{2} ≜ (\neg V_{1}) \land (\neg V_{2}), \\ Y_{1} ≜ V_{1} \land (\neg V_{2}), Y_{2} ≜ (\neg V_{1}) \land V_{2}, \end{matrix}

(17)

where ¬ is the binary negation, and ∧ is the binary “and” operation. This particular distribution satisfies the coding structure depicted in Figure 1, with

(V_{1}, V_{2})

taking the role of

(W_{1}, W_{2})

, and the relation is non-linear. The same distribution was used in [19] to construct a multiround PIR code. This non-linear mapping appears to allow the resultant code to be more efficient than linear codes.

We wish to store

(X_{1}^{L}, X_{2}^{L})

at the first database in a lossless manner, however, store only certain necessary information regarding

Y_{1}^{L}

and

Y_{2}^{L}

to facilitate the recovery of

W_{1}

or

W_{2}

. For this purpose, we will encode the message as follows:

At database-1, compress and store $(X_{1}^{L}, X_{2}^{L})$ losslessly;
At database-2, encode $Y_{1}^{L}$ using a Slepian–Wolf code (or more precisely Sgarro’s code with uncertainty side information [29]), with either $X_{1}^{L}$ or $X_{2}^{L}$ at the decoder, whose resulting code index is denoted as $C_{Y_{1}}$ ; encode $Y_{2}^{L}$ in the same manner, independent of $Y_{1}^{L}$ , whose code index is denoted as $C_{Y_{2}}$ .

It is clear that for database-1, we need roughly

{\bar{α}}_{1} = H (X_{1}, X_{2})

. At database-2, in order to guarantee successful decoding of the Slepian-Wolf code, we can chose roughly

\begin{matrix} \bar{α_{2}} & = max (H (Y_{1} | X_{1}), H (Y_{1} | X_{2})) + max (H (Y_{2} | X_{1}), H (Y_{2} | X_{2})) \\ = 2 H (Y_{1} | X_{1}), \end{matrix}

(18)

where the second equality is due to the symmetry in the probability distribution. Thus we find that this code achieves

\begin{matrix} {\bar{α}}_{n l} & = 0.5 [H (X_{1}, X_{2}) + 2 H (Y_{1} | X_{1})] \\ = 0.75 + 0.75 H (1 / 3, 2 / 3) \\ = 0.25 + 0.75 {log}_{2} 3 \approx 1.4387 . \end{matrix}

(19)

The retrieval strategy is immediate from the coding structure in Figure 1, with

(V_{1}^{L}, V_{2}^{L}, X_{1}^{L}, X_{2}^{L}, C_{Y_{1}}, C_{Y_{2}})

serving the roles of

(W_{1}, W_{2}, X_{1}, X_{2}, Y_{1}, Y_{2})

, and thus indeed the privacy constraint is satisfied. The retrieval rates are roughly as follows:

\begin{matrix} {\bar{β}}_{1}^{(1)} = {\bar{β}}_{1}^{(2)} = H (X_{1}) = H (X_{2}), \end{matrix}

(20)

\begin{matrix} {\bar{β}}_{2}^{(1)} = {\bar{β}}_{2}^{(2)} = H (Y_{1} | X_{1}), \end{matrix}

(21)

implying

\begin{matrix} \bar{β} = 0.5 [H (X_{1}) + H (Y_{1} | X_{1})] = 0.5 H (Y_{1}, X_{1}) = 0.75 . \end{matrix}

Thus, at the optimal retrieval rate

\bar{β} = 0.75

, we have

\begin{matrix} {\bar{α}}_{l} = 1.5 vs . {\bar{α}}_{n l} \approx 1.4387, \end{matrix}

(22)

and clearly the proposed non-linear Shannon-theoretic code is able to perform better than the optimal linear code. We note that it was shown in [19] by using a multround approach, the storage rate

\bar{α}

can be further reduced; however, this issue is beyond the scope of this work. In the rest of the paper, we build on the intuition in this special case to generalize and strengthen the coding scheme.

4. Main Result

4.1. A General Inner Bound

We first present a general inner bound to the storage–retrieval trade-off region. Let

(V_{1}, V_{2})

be independent random variables uniformly distributed on

F_{2}^{t} \times F_{2}^{t}

. Define the region

R_{i n}^{(t)}

to be the collection of

(\bar{α}, \bar{β})

pairs for which there exist random variables

(X_{0}, X_{1}, X_{2}, Y_{1}, Y_{2})

jointly distributed with

(V_{1}, V_{2})

such that the following hold:

There exist deterministic functions $f_{1, 1}$ , $f_{1, 2}$ , $f_{2, 1}$ , and $f_{2, 2}$ such that

$\begin{matrix} V_{1} = f_{1, 1} (X_{0}, X_{1}, Y_{1}) = f_{2, 2} (X_{0}, X_{2}, Y_{2}), V_{2} = f_{1, 2} (X_{0}, X_{1}, Y_{2}) = f_{2, 1} (X_{0}, X_{2}, Y_{1}); \end{matrix}$

(23)
There exist non-negative coding rates

$\begin{matrix} (β_{1}^{(0)}, β_{1}^{(1)}, β_{1}^{(2)}, β_{2}^{(1)}, β_{2}^{(2)}, γ_{1}^{(0)}, γ_{1}^{(1)}, γ_{1}^{(2)}, γ_{2}^{(1)}, γ_{2}^{(2)}) \\ \in R_{M D}^{*} (((V_{1}, V_{2}), X_{0}, X_{1}, X_{2}, Y_{1}, Y_{2}), \\ ({X_{0}, X_{1}, Y_{1}}, {X_{0}, X_{1}, Y_{2}}, {X_{0}, X_{2}, Y_{1}}, {X_{0}, X_{2}, Y_{2}})); \end{matrix}$

(24)
There exist non-negative storage rates $(α_{1}^{(0)}, α_{1}^{(1)}, α_{1}^{(2)}, α_{2}^{(1)}, α_{2}^{(2)})$ such that

$\begin{matrix} α_{1}^{(0)} \leq β_{1}^{(0)}, α_{1}^{(1)} \leq β_{1}^{(1)}, α_{1}^{(2)} \leq β_{1}^{(2)}, α_{2}^{(1)} \leq β_{2}^{(1)}, α_{2}^{(2)} \leq β_{2}^{(2)}, \end{matrix}$

(25)

and if

$\begin{matrix} γ_{1}^{(0)} - β_{1}^{(0)} + γ_{1}^{(1)} - β_{1}^{(1)} + γ_{1}^{(2)} - β_{1}^{(2)} < H (X_{1}) + H (X_{2}) + H (X_{3}) - H (X_{0}, X_{1}, X_{2}), \end{matrix}$

(26)

choose

$\begin{matrix} (α_{1}^{(0)}, α_{1}^{(1)}, α_{1}^{(2)}, γ_{1}^{(0)}, γ_{1}^{(1)}, γ_{1}^{(2)}) \in R_{M D}^{*} (((V_{1}, V_{2}), X_{0}, X_{1}, X_{2}), ({X_{0}, X_{1}, X_{2}})); \end{matrix}$

(27)

otherwise, choose $(α_{1}^{(0)}, α_{1}^{(1)}, α_{1}^{(2)}) = (β_{1}^{(0)}, β_{1}^{(1)}, β_{1}^{(2)})$ . Similarly, if

$\begin{matrix} γ_{2}^{(1)} - β_{2}^{(1)} + γ_{2}^{(2)} - β_{2}^{(2)} < I (Y_{1}; Y_{2}), \end{matrix}$

(28)

choose

$\begin{matrix} (α_{2}^{(1)}, α_{2}^{(2)}, γ_{2}^{(1)}, γ_{2}^{(2)}) \in R_{M D}^{*} (((V_{1}, V_{2}), Y_{1}, Y_{2}), ({Y_{1}, Y_{2}})), \end{matrix}$

(29)

otherwise $(α_{2}^{(1)}, α_{2}^{(2)}) = (β_{1}^{(1)}, β_{1}^{(2)})$ ;
The normalized average retrieval and storage rates

$\begin{matrix} 2 t \bar{α} \geq α_{1}^{(0)} + α_{1}^{(1)} + α_{1}^{(2)} + α_{2}^{(1)} + α_{2}^{(2)}, \end{matrix}$

(30)

$\begin{matrix} 4 t \bar{β} \geq 2 β_{1}^{(0)} + β_{1}^{(1)} + β_{1}^{(2)} + β_{2}^{(1)} + β_{2}^{(2)} . \end{matrix}$

(31)

Then, we have the following theorem.

Theorem 1.

R_{i n}^{(t)} \subseteq R

.

We can, in fact, potentially enlarge the achievable region by taking

\cup_{t = 1}^{\infty} R_{i n}^{(t)}

. However, unless

R_{i n}^{(t + 1)} \subseteq R_{i n}^{(t)}

for all

t \geq 1

, the region

\cup_{t = 1}^{\infty} R_{i n}^{(t)}

is even more difficult to characterize. Nevertheless, for each fixed t, we can identify inner bounds by specifying a feasible set of random variables

X_{0}, X_{1}, X_{2}, Y_{1}, Y_{2}

.

Instead of directly establishing this theorem, we shall prove the following theorem which establishes the existence of a PIR code with diminishing error probability, and then use an expurgation technique to extract a zero-error PIR code.

Theorem 2.

Consider any

(\bar{α}, \bar{β}) \in R_{i n}^{(t)}

. For any

ϵ > 0

and sufficiently large L, there exists an

(L, L (\bar{α} + ϵ), L (\bar{α} + ϵ), L (\bar{β} + ϵ), L (\bar{β} + ϵ))

ϵ-error PIR code with the query distribution given in (10) and (11).

The key observation to establish this theorem is that there are five descriptions in this setting; however, the retrieval and storage place different constraints on different combination of descriptions, and some descriptions can, in fact, be stored, recompressed, and then retrieved. Such compression and recompression may lead to storage savings. The description based on

X_{0}

can be viewed as some common information to

X_{1}

and

X_{2}

, which allows us to trade-off the storage and retrieval rates.

Proof of Theorem 2.

Codebook generation: Codebooks are built using the MD codebooks based on the distribution

((V_{1}, V_{2}), X_{0}, X_{1}, X_{2}, Y_{1}, Y_{2})

.

Storage codes: The bin indices of the codebooks are stored in the two servers: those of

X_{0}

,

X_{1}

, and

X_{2}

are stored at server-1 at rates

α_{1}^{(0)}

,

α_{1}^{(1)}

, and

α_{1}^{(2)}

, respectively; those of

Y_{1}

and

Y_{2}

are stored at server-2 at rates

α_{2}^{(1)}

and

α_{2}^{(2)}

. Note that at such rates, the codewords for

X_{0}

,

X_{1}

, and

X_{2}

can be recovered jointly with overwhelming probability, while those for

Y_{1}

and

Y_{2}

can also be recovered jointly with overwhelming probability.

Retrieval codes: A different set of bin indices of the codebooks are retrieved during the retrieval process, again based on the MD* codebooks: those of

X_{0}

,

X_{1}

, and

X_{2}

are retrieved at server-1 at rates

β_{1}^{(0)}

,

β_{1}^{(1)}

, and

β_{1}^{(2)}

, respectively; those of

Y_{1}

and

Y_{2}

are retrieved at server-2 at rates

β_{2}^{(1)}

and

β_{2}^{(2)}

. Note that at such rates, the codewords of

X_{0}

,

X_{1}

, and

Y_{1}

can be jointly recovered such that using the three corresponding codewords, the required

V_{1}

source vector can be recovered with overwhelming probability. Similarly, the three retrieval patterns of

(X_{0}, X_{1}, Y_{2}) \to V_{2}

,

(X_{0}, X_{2}, Y_{1}) \to V_{2}

, and

(X_{0}, X_{2}, Y_{2}) \to V_{2}

will succeed with overwhelming probabilities.

Storage and retrieval rates: The rates can be computed straightforwardly, after normalization by the parameter t. □

Next we use it to prove Theorem 1.

Proof of Theorem 1.

Given an

ϵ > 0

, according to Proposition 2, we can find an

(L, L (\bar{α} + ϵ), L (\bar{α} + ϵ), L (\bar{β} + ϵ), L (\bar{β} + ϵ))

ϵ

-error PIR code for some sufficient large L. The probability of error of this code can be rewritten as

\begin{matrix} P_{e} & = 0.5 \sum_{k = 1, 2} \sum_{(w_{1}, w_{2})} 2^{- 2 L} P_{Q_{1}^{[k]}, Q_{2}^{[k]} | (w_{1}, w_{2})} (w_{k} \neq {\hat{W}}_{k}) . \end{matrix}

For a fixed

(w_{1}, w_{2})

pair, denote the event that there exists a

(q_{1}, q_{2}) \in {(11), (22)}

, i.e., when

(Q_{1}^{[1]}, Q_{2}^{[1]}) = (q_{1}, q_{2})

, such that

{\hat{w}}_{1} \neq w_{1}

as

E_{w_{1}, w_{2}}^{(1)}

, and there exists a

(q_{1}, q_{2}) \in {(12), (21)}

such that

{\hat{w}}_{2} \neq w_{2}

as

E_{w_{1}, w_{2}}^{(2)}

. Since

(Q_{1}^{[k]}, Q_{2}^{[k]})

is independent of

(W_{1}, W_{2})

, if

P (E_{w_{1}, w_{2}}^{(k)}) \neq 0

, we must have

P (E_{w_{1}, w_{2}}^{(k)}) \geq 0.5

. It follows that

\begin{matrix} P_{e} & \geq 0.25 \sum_{(w_{1}, w_{2})} 2^{- 2 L} 1 (E_{w_{1}, w_{2}}^{[1]} \cup E_{w_{1}, w_{2}}^{[2]}), \end{matrix}

(32)

where

(\cdot)

is the indicator function. This implies that for any

ϵ \leq 0.125

, there are at most

2^{2 L - 1}

pairs of

(w_{1}, w_{2})

that will induce any coding error. We can use any

2^{2 L - 2}

of the remaining

2^{2 L - 1}

pairs of L-bit sequence pairs to instead store a pair of

(L - 1)

-bit messages, through an arbitrary but fixed one-to-one mapping. This new code has a factor of

1 + 1 / (L - 1)

increase in the normalized coding rates, which is negligible when L is large. Thus a zero-error PIR code is found with the same normalized rates as the

ϵ

-error code asymptotically, and this completes the proof. □

4.2. Outer Bounds

We next turn our attention to the outer bounds for

R

, summarized in the following theorem.

Theorem 3.

Any

(\bar{α}, \bar{β}) \in R

must satisfy

\begin{matrix} \bar{β} \geq 0.75, \bar{α} + \bar{β} \geq 2, 3 \bar{α} + 8 \bar{β} \geq 10 . \end{matrix}

(33)

Moreover, if

(\bar{α}, \bar{β}) \in R

can be achieved by a linear code, it must satisfy

\begin{matrix} \bar{α} + 6 \bar{β} \geq 6 . \end{matrix}

(34)

The inequality

\bar{β} \geq 0.75

follows from [11], while the two other bounds in (33) were proved in [24]. Therefore, we only need to prove (34).

Proof of Theorem 3.

Following [19], we make the following simplifying assumptions that have no loss of generality. Define

Q = {Q_{1}^{[1]}, Q_{1}^{[2]}, Q_{2}^{[1]}, Q_{2}^{[2]}}

.

\begin{matrix} 1 . Q_{1}^{[1]} = Q_{1}^{[2]} \Rightarrow A_{1}^{[1]} = A_{1}^{[2]}, \end{matrix}

(35)

\begin{matrix} 2 . H (A_{1}^{[1]} | Q) = H (A_{2}^{[1]} | Q) = H (A_{2}^{[2]} | Q), H (S_{1}) = H (S_{2}) \end{matrix}

(36)

\begin{matrix} \Rightarrow H (A_{1}^{[1]} | Q) \leq β \leq (\bar{β} + ϵ) L, H (S_{2}) \leq α \leq (\bar{α} + ϵ) L . \end{matrix}

(37)

Assumption 1 states that the query to the first database is the same regardless of the desired message index. This is justified by the privacy condition that the query to one database is independent of the desired message index. Assumption 2 states that the scheme is symmetric after the symmetrization operation in Lemma 1 (the proof is referred to Theorem 3 in [19]). Then, (37) follows from the fact that to describe

S_{2}, A_{1}^{[1]}

, the number of bits needed cannot be less than the entropy value, and Definition 1.

In the following, we use

(c)

to refer to the correctness condition,

(i)

to refer to the constraint that queries are independent of the messages,

(a)

to refer to the constraint that answers are deterministic functions of the storage variables and corresponding queries, and

(p)

to refer to the privacy condition.

From

A_{1}^{[1]}, A_{2}^{[1]}, Q

, we can decode

W_{1}

.

\begin{matrix} H (A_{1}^{[1]}, A_{2}^{[1]} | W_{1}, Q) & = & H (A_{1}^{[1]}, A_{2}^{[1]}, W_{1} | Q) - H (W_{1} | Q) \end{matrix}

(38)

\begin{matrix} \overset{(c) (i)}{=} & H (A_{1}^{[1]}, A_{2}^{[1]} | Q) - L \end{matrix}

(39)

\begin{matrix} \overset{(36)}{\leq} & 2 H (A_{1}^{[1]} | Q) - L . \end{matrix}

(40)

Next, consider Ingleton’s inequality.

\begin{matrix} I (A_{2}^{[1]}; A_{2}^{[2]} | Q) & \leq & I (A_{2}^{[1]}; A_{2}^{[2]} | W_{1}, Q) + I (A_{2}^{[1]}; A_{2}^{[2]} | W_{2}, Q) \end{matrix}

(41)

\begin{matrix} = & 2 I (A_{2}^{[1]}; A_{2}^{[2]} | W_{1}, Q) \end{matrix}

(42)

\begin{matrix} = & 2 (H (A_{2}^{[1]} | W_{1}, Q) + H (A_{2}^{[2]} | W_{1}, Q) - H (A_{2}^{[1]}, A_{2}^{[2]} | W_{1}, Q)) \end{matrix}

(43)

\begin{matrix} \overset{(p)}{=} & 2 (2 H (A_{2}^{[1]} | W_{1}, Q) - H (A_{2}^{[1]}, A_{2}^{[2]} | W_{1}, Q)) \end{matrix}

(44)

\begin{matrix} \leq & 2 (2 H (A_{2}^{[1]} | W_{1}, Q) + H (A_{1}^{[1]}, A_{2}^{[1]} | W_{1}, Q) \\ - H (A_{1}^{[1]}, A_{2}^{[1]}, A_{2}^{[2]} | W_{1}, Q) - H (A_{2}^{[1]} | W_{1}, Q)) \end{matrix}

(45)

\begin{matrix} \overset{(c) (35)}{=} & 2 (H (A_{2}^{[1]} | W_{1}, Q) + H (A_{1}^{[1]}, A_{2}^{[1]} | W_{1}, Q) \\ - H (A_{1}^{[1]}, A_{2}^{[1]}, A_{2}^{[2]}, W_{2} | W_{1}, Q)) \end{matrix}

(46)

\begin{matrix} \overset{(i)}{\leq} & 2 (2 H (A_{1}^{[1]}, A_{2}^{[1]} | W_{1}, Q) - H (W_{2})) \end{matrix}

(47)

\begin{matrix} \overset{(40)}{\leq} & 2 (2 (2 H (A_{1}^{[1]} | Q) - L) - L) \end{matrix}

(48)

where (42) follows from the observation that the second term can be bounded using the same method as that bounds the first term by switching the message index. A more detailed derivation of (44) appears in (79) of [19]; (45) is due to the sub-modularity of entropy.

Note that

\begin{matrix} I (A_{2}^{[1]}; A_{2}^{[2]} | Q) & = & H (A_{2}^{[1]} | Q) + H (A_{2}^{[2]} | Q) - H (A_{2}^{[1]}, A_{2}^{[2]} | Q) \end{matrix}

(49)

\begin{matrix} \overset{(36)}{\geq} & 2 H (A_{1}^{[1]} | Q) - (\bar{α} + ϵ) L \end{matrix}

(50)

where in (50), and the second term is bounded as follows:

\begin{matrix} H (A_{2}^{[1]}, A_{2}^{[2]} | Q) \leq H (A_{2}^{[1]}, A_{2}^{[2]}, S_{2} | Q) \overset{(a)}{=} H (S_{2} | Q) \overset{(37)}{\leq} (\bar{α} + ϵ) L . \end{matrix}

(51)

Combining (48) and (50), we have

\begin{matrix} 2 H (A_{1}^{[1]} | Q) / L - (\bar{α} + ϵ) \geq 2 (4 H (A_{1}^{[1]} | Q) / L - 3) \end{matrix}

\begin{matrix} \Rightarrow & \bar{α} + ϵ + 6 H (A_{1}^{[1]} | Q) / L \geq 6 \end{matrix}

(52)

\begin{matrix} \overset{(37)}{\Rightarrow} & \bar{α} + 6 \bar{β} \geq 6 . \end{matrix}

(53)

The proof is complete. □

4.3. Specialization of the Inner Bound

The inner bound given in Theorem 1 is general but more involved, and we can specialize it in multiple ways in order to simplify it. One particularly interesting approach is as follows. Define the region

{\tilde{R}}_{i n}^{(t)}

to be the collection of

(\bar{α}, \bar{β})

pairs such that there exists random variables

(X_{0}, X_{1}, X_{2}, Y_{1}, Y_{2})

jointly distributed with

(V_{1}, V_{2})

such that the following hold:

The distribution factorizes as follows

$P_{V_{1}, V_{2}, X_{0}, X_{1}, X_{2}, Y_{1}, Y_{2}} = P_{V_{1}, V_{2}} P_{X_{0} | V_{1}, V_{2}} P_{X_{1} | V_{1}, V_{2}} P_{X_{2} | V_{1}, V_{2}} P_{Y_{1} | V_{1}, V_{2}} P_{Y_{2} | V_{1}, V_{2}};$
There exist deterministic functions $f_{1, 1}$ , $f_{1, 2}$ , $f_{2, 1}$ , and $f_{2, 2}$ such that

$\begin{matrix} V_{1} = f_{1, 1} (X_{0}, X_{1}, Y_{1}) = f_{2, 2} (X_{0}, X_{2}, Y_{2}), \end{matrix}$

(54)

$\begin{matrix} V_{2} = f_{1, 2} (X_{0}, X_{1}, Y_{2}) = f_{2, 1} (X_{0}, X_{2}, Y_{1}); \end{matrix}$

(55)
A set of rates

$\begin{matrix} γ_{1}^{(0)} = I (V_{1}, V_{2}; X_{0}), γ_{1}^{(1)} = I (V_{1}, V_{2}; X_{1}), γ_{1}^{(2)} = I (V_{1}, V_{2}; X_{2}), \end{matrix}$

(56)

$\begin{matrix} γ_{2}^{(1)} = I (V_{1}, V_{2}; Y_{1}), γ_{2}^{(2)} = I (V_{1}, V_{2}; Y_{2}), \end{matrix}$

(57)

$\begin{matrix} β_{1}^{(0)} = γ_{1}^{(0)}, β_{1}^{(1)} = I (V_{1}, V_{2}; X_{1} | X_{0}), β_{1}^{(2)} = I (V_{1}, V_{2}; X_{2} | X_{0}), \end{matrix}$

(58)

$\begin{matrix} β_{2}^{(1)} = max (I (V_{1}, V_{2}; Y_{1} | X_{0}, X_{1}), I (V_{1}, V_{2}; Y_{1} | X_{0}, X_{2})), \end{matrix}$

(59)

$\begin{matrix} β_{2}^{(2)} = max (I (V_{1}, V_{2}; Y_{2} | X_{0}, X_{1}), I (V_{1}, V_{2}; Y_{2} | X_{0}, X_{2})), \end{matrix}$

(60)

and $(α_{1}^{(0)} = γ_{1}^{(0)}, α_{1}^{(1)}, α_{1}^{(2)}, α_{2}^{(1)}, α_{2}^{(2)})$ as defined in item 3 for the general region $R^{(t)}$ ;
The normalized average retrieval and storage rates

$\begin{matrix} 2 t \bar{α} \geq α_{1}^{(0)} + α_{1}^{(1)} + α_{1}^{(2)} + α_{2}^{(1)} + α_{2}^{(2)}, \end{matrix}$

(61)

$\begin{matrix} 4 t \bar{β} \geq 2 β_{1}^{(0)} + β_{1}^{(1)} + β_{1}^{(2)} + β_{2}^{(1)} + β_{2}^{(2)} . \end{matrix}$

(62)

Then we have the following corollary.

Corollary 1.

{\tilde{R}}_{i n}^{(t)} \subseteq R

.

This inner bound is illustrated together with the outer bounds in Figure 2.

Proof.

The main difference from Theorem 1 is in the special dependence structure of

(X_{0}, X_{1}, X_{2}, Y_{1}, Y_{2})

jointly distributed with

(V_{1}, V_{2})

, i.e., the Markov structure. We verify that the rate assignments satisfy all the constraints in Theorem 1. Due to the special dependence structure of

(X_{0}, X_{1}, X_{2}, Y_{1}, Y_{2})

jointly distributed with

(V_{1}, V_{2})

, it is straightforward to verify that

(γ_{1}^{(0)}, γ_{1}^{(1)}, γ_{1}^{(2)}, γ_{2}^{(1)}, γ_{2}^{(2)}) \in R_{M D} ((V_{1}, V_{2}), X_{0}, X_{1}, X_{2}, Y_{1}, Y_{2}) .

We next verify that (24) holds with the choice given above. Due to the symmetry in the structure, we only need to confirm one subset of random variables, i.e.,

{X_{0}, X_{1}, Y_{1}}

, and the three other subsets

{X_{0}, X_{1}, Y_{2}}

,

{X_{0}, X_{2}, Y_{1}}

, and

{X_{0}, X_{2}, Y_{2}}

follow similarly. There are a total of 7 conditions in the form of (14) associated with this subset

{X_{0}, X_{1}, Y_{1}}

. Notice that

\begin{matrix} γ_{1}^{(0)} - β_{1}^{(0)} = 0, γ_{1}^{(1)} - β_{1}^{(1)} = I (X_{1}; X_{0}), γ_{2}^{(2)} - β_{2}^{(2)} \leq I (Y_{1}; X_{0}, X_{1}), \end{matrix}

which in fact confirm three of the seven conditions when

J

is a singleton. Next, when

J

has two elements, we verify that

\begin{matrix} γ_{1}^{(0)} - β_{1}^{(0)} + γ_{1}^{(1)} - β_{1}^{(1)} & = I (X_{1}; X_{0}) = H (X_{0}) + H (X_{1}) - H (X_{0}, X_{1}) \\ \leq H (X_{0}) + H (X_{1}) - H (X_{0}, X_{1} | Y_{1}), \end{matrix}

(63)

\begin{matrix} γ_{1}^{(0)} - β_{1}^{(0)} + γ_{2}^{(1)} - β_{2}^{(1)} & \leq I (Y_{1}; X_{0}, X_{1}) = H (Y_{1}) + H (X_{0}, X_{1}) - H (X_{0}, X_{1}, Y_{1}) \\ \leq H (Y_{1}) + H (X_{0}) + H (X_{1}) - H (X_{0}, X_{1}, Y_{1}) \\ = H (X_{0}) + H (Y_{1}) - H (X_{0}, Y_{1} | X_{1}), \end{matrix}

(64)

\begin{matrix} γ_{1}^{(1)} - β_{1}^{(1)} + γ_{2}^{(1)} - β_{2}^{(1)} & \leq I (X_{1}; X_{0}) + I (Y_{1}; X_{0}, X_{1}) = H (X_{1}) + H (Y_{1}) - H (X_{1}, Y_{1} | X_{0}) . \end{matrix}

(65)

Finally when

J

has all the three elements, we have

\begin{matrix} γ_{1}^{(0)} - β_{1}^{(0)} + γ_{1}^{(1)} - β_{1}^{(1)} + γ_{2}^{(1)} - β_{2}^{(1)} \end{matrix}

\begin{matrix} = I (X_{0}; X_{1}) + I (V_{1}, V_{2}; X_{1}) - max (I (V_{1}, V_{2}; Y_{1} | X_{0}, X_{1}), I (V_{1}, V_{2}; Y_{1} | X_{0}, X_{2})) \end{matrix}

(66)

\begin{matrix} \leq I (X_{0}; X_{1}) + I (V_{1}, V_{2}; X_{1}) - I (V_{1}, V_{2}; Y_{1} | X_{0}, X_{1}) \end{matrix}

(67)

\begin{matrix} = H (X_{0}) + H (X_{1}) + H (Y_{1}) - H (X_{0}, X_{1}, Y_{1}) . \end{matrix}

(68)

Thus, (24) is indeed true with the assignments (56)–(60). This, in fact, completes the proof. □

We can use any explicit distribution

(X_{0}, X_{1}, X_{2}, Y_{1}, Y_{2})

to obtain an explicit inner bound to

{\tilde{R}}_{i n}^{(t)}

, and the next corollary provides one such non-trivial bound. For convenience, we write the entropy function of a probability mass

(p_{1}, \dots, p_{t})

as

H (p_{1}, \dots, p_{t})

.

Corollary 2.

The following

(\bar{α}, \bar{β}) \in R

for any

p \in [0, 1]

:

\begin{matrix} \bar{α} = & \frac{9}{4} - H (\frac{1}{4}, \frac{3}{4}) + \frac{1}{4} H (\frac{1 - p}{2}, \frac{1 - p}{2}, \frac{p}{2}, \frac{p}{2}) \\ + \frac{1}{2} H (\frac{2 - p}{4}, \frac{2 - p}{4}, \frac{p}{2}) - \frac{3}{4} H (\frac{3 - 2 p}{6}, \frac{3 - 2 p}{6}, \frac{p}{3}, \frac{p}{3}), \\ \bar{β} = & \frac{5}{8} + \frac{1}{4} H (\frac{2 - p}{4}, \frac{2 - p}{4}, \frac{p}{2}) - \frac{1}{8} H (\frac{1 - p}{2}, \frac{1 - p}{2}, p) . \end{matrix}

Proof.

These trade-off pairs are obtained by applying Corollary 1, and choosing

t = 1

and setting

(X_{1}, X_{2}, Y_{1}, Y_{2})

as given in (17), and letting

X_{0}

be defined as in Table 1. Note that the joint distribution indeed satisfies the required Markov structure, and in this case

α_{2}^{(1)} = β_{2}^{(1)}

and

α_{2}^{(2)} = β_{2}^{(2)}

. □

5. Conclusions

We consider the problem of private information retrieval using a Shannon-theoretic approach. A new coding scheme based on random coding and binning is proposed, which reveals a hidden connection to the multiple description problem. It is shown that for the

(2, 2)

PIR setting, this non-linear coding scheme is able to provide the best known tradeoff between retrieval rate and storage rate, which is strictly better than that achievable using linear codes. We further investigate the relation between zero-error PIR codes and

ϵ

-error PIR codes in this setting and show that they do not cause any essential difference in this problem setting. We hope that the hidden connection to multiple description coding can provide a new avenue to design more efficient PIR codes.

Author Contributions

Conceptualization, C.T., H.S. and J.C.; Methodology, C.T., H.S. and J.C.; Investigation, C.T., H.S. and J.C.; Writing—original draft, C.T.; Writing—review and editing, C.T., H.S. and J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Science Foundation through grants CCF-18-16518, CCF-18-16546, CCF-20-07067, CCF-20-07108, and CCF-20-45656.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chor, B.; Goldreich, O.; Kushilevitz, E.; Sudan, M. Private information retrieval. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science, Milwaukee, WI, USA, 23–25 October 1995; pp. 41–50. [Google Scholar]
Shah, N.; Rashmi, K.; Ramchandran, K. One extra bit of download ensures perfectly private information retrieval. In Proceedings of the 2014 IEEE International Symposium on Information Theory (ISIT), Honolulu, HI, USA, 29 June–4 July 2014; pp. 856–860. [Google Scholar]
Fazeli, A.; Vardy, A.; Yaakobi, E. Codes for distributed PIR with low storage overhead. In Proceedings of the 2015 Proceedings of IEEE International Symposium on Information Theory (ISIT), Hong Kong, China, 14–19 June 2015; pp. 2852–2856. [Google Scholar]
Rao, S.; Vardy, A. Lower bound on the redundancy of PIR codes. arXiv 2016, arXiv:1605.01869. [Google Scholar]
Blackburn, S.R.; Etzion, T. PIR array codes with optimal virtual server rate. IEEE Trans. Inf. Theory 2019, 65, 6136–6145. [Google Scholar] [CrossRef] [Green Version]
Blackburn, S.R.; Etzion, T.; Paterson, M.B. PIR schemes with small download complexity and low storage requirements. IEEE Trans. Inf. Theory 2019, 66, 557–571. [Google Scholar] [CrossRef] [Green Version]
Zhang, Y.; Wang, X.; Wei, H.; Ge, G. On private information retrieval array codes. IEEE Trans. Inf. Theory 2019, 65, 5565–5573. [Google Scholar] [CrossRef] [Green Version]
Vajha, M.; Ramkumar, V.; Kumar, P.V. Binary, shortened projective Reed Muller codes for coded private information retrieval. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 2648–2652. [Google Scholar]
Asi, H.; Yaakobi, E. Nearly optimal constructions of PIR and batch codes. IEEE Trans. Inf. Theory 2018, 65, 947–964. [Google Scholar] [CrossRef] [Green Version]
Chan, T.H.; Ho, S.W.; Yamamoto, H. Private information retrieval for coded storage. In Proceedings of the 2015 IEEE International Symposium on Information Theory (ISIT), Hong Kong, China, 14–19 June 2015; pp. 2842–2846. [Google Scholar]
Sun, H.; Jafar, S.A. The capacity of private information retrieval. IEEE Trans. Inf. Theory 2017, 63, 4075–4088. [Google Scholar] [CrossRef]
Tajeddine, R.; Gnilke, O.W.; El Rouayheb, S. Private information retrieval from MDS coded data in distributed storage systems. IEEE Trans. Inf. Theory 2018, 64, 7081–7093. [Google Scholar] [CrossRef] [Green Version]
Banawan, K.; Ulukus, S. The capacity of private information retrieval from coded databases. IEEE Trans. Inf. Theory 2018, 64, 1945–1956. [Google Scholar] [CrossRef]
Tian, C.; Sun, H.; Chen, J. Capacity-achieving private information retrieval codes with optimal message size and upload cost. IEEE Trans. Inf. Theory 2019, 65, 7613–7627. [Google Scholar] [CrossRef] [Green Version]
Zhou, R.; Tian, C.; Sun, H.; Liu, T. Capacity-achieving private information retrieval codes from MDS-coded databases with minimum message size. IEEE Trans. Inf. Theory 2020, 66, 4904–4916. [Google Scholar] [CrossRef] [Green Version]
Sun, H.; Jafar, S.A. The capacity of robust private information retrieval with colluding databases. IEEE Trans. Inf. Theory 2018, 64, 2361–2370. [Google Scholar] [CrossRef]
Ulukus, S.; Avestimehr, S.; Gastpar, M.; Jafar, S.; Tandon, R.; Tian, C. Private retrieval, computing and learning: Recent progress and future challenges. IEEE J. Sel. Areas Commun. 2022, 40, 729–748. [Google Scholar] [CrossRef]
Attia, M.A.; Kumar, D.; Tandon, R. The capacity of private information retrieval from uncoded storage constrained databases. IEEE Trans. Inf. Theory 2020, 66, 6617–6634. [Google Scholar] [CrossRef]
Sun, H.; Jafar, S.A. Multiround private information retrieval: Capacity and storage overhead. IEEE Trans. Inf. Theory 2018, 64, 5743–5754. [Google Scholar] [CrossRef] [Green Version]
Sun, H.; Tian, C. Breaking the MDS-PIR capacity barrier via joint storage coding. Information 2019, 10, 265. [Google Scholar] [CrossRef] [Green Version]
Guo, T.; Zhou, R.; Tian, C. New results on the storage-retrieval tradeoff in private information retrieval systems. IEEE J. Sel. Areas Inf. Theory 2021, 2, 403–414. [Google Scholar] [CrossRef]
Tian, C.; Sun, H.; Chen, J. A Shannon-theoretic approach to the storage-retrieval tradeoff in PIR systems. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–20 June 2018; pp. 1904–1908. [Google Scholar]
Gamal, A.; Cover, T. Achievable rates for multiple descriptions. IEEE Trans. Inf. Theory 1982, 28, 851–857. [Google Scholar] [CrossRef] [Green Version]
Tian, C. On the storage cost of private information retrieval. IEEE Trans. Inf. Theory 2020, 66, 7539–7549. [Google Scholar] [CrossRef]
Venkataramani, R.; Kramer, G.; Goyal, V.K. Multiple description coding with many channels. IEEE Trans. Inf. Theory 2003, 49, 2106–2114. [Google Scholar] [CrossRef]
Wyner, A.; Ziv, J. The rate-distortion function for source coding with side information at the decoder. IEEE Trans. Inf. Theory 1976, 22, 1–10. [Google Scholar] [CrossRef]
Pradhan, S.S.; Puri, R.; Ramchandran, K. n-channel symmetric multiple descriptions-part I: (n,k) source-channel erasure codes. IEEE Trans. Inf. Theory 2004, 50, 47–61. [Google Scholar] [CrossRef]
Tian, C.; Chen, J. New coding schemes for the symmetric K-description problem. IEEE Trans. Inf. Theory 2010, 56, 5344–5365. [Google Scholar] [CrossRef]
Sgarro, A. Source coding with side information at several decoders. IEEE Trans. Inf. Theory 1977, 23, 179–182. [Google Scholar] [CrossRef]

Figure 1. A possible coding structure.

Figure 2. Illustration of inner bounds and outer bounds.

Table 1. Conditional distribution

P_{X_{0} | W_{1}, W_{2}}

used in Corollary 2.

Table 1. Conditional distribution

P_{X_{0} | W_{1}, W_{2}}

used in Corollary 2.

$(w_{1}, w_{2})$	$x_{0} = (00)$	$x_{0} = (01)$	$x_{0} = (10)$	$x_{0} = (11)$
(00)	$1 / 2$			1/2
(10)	$(1 - p) / 2$	p		$(1 - p) / 2$
(01)	$(1 - p) / 2$		p	$(1 - p) / 2$
(11)	1/2			1/2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tian, C.; Sun, H.; Chen, J. A Shannon-Theoretic Approach to the Storage–Retrieval Trade-Off in PIR Systems. Information 2023, 14, 44. https://0-doi-org.brum.beds.ac.uk/10.3390/info14010044

AMA Style

Tian C, Sun H, Chen J. A Shannon-Theoretic Approach to the Storage–Retrieval Trade-Off in PIR Systems. Information. 2023; 14(1):44. https://0-doi-org.brum.beds.ac.uk/10.3390/info14010044

Chicago/Turabian Style

Tian, Chao, Hua Sun, and Jun Chen. 2023. "A Shannon-Theoretic Approach to the Storage–Retrieval Trade-Off in PIR Systems" Information 14, no. 1: 44. https://0-doi-org.brum.beds.ac.uk/10.3390/info14010044

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Shannon-Theoretic Approach to the Storage–Retrieval Trade-Off in PIR Systems

Abstract

1. Introduction

2. Preliminaries

2.1. Problem Definition

2.2. Some Relevant Known Results

2.3. Multiple Description Source Coding

3. A Special Case: Slepian–Wolf Coding for Minimum Retrieval Rate

4. Main Result

4.1. A General Inner Bound

4.2. Outer Bounds

4.3. Specialization of the Inner Bound

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI