3.1. Motivations
Organizing the search range of CkQST. The first issue of using a quadtree to organize the search range of CkQST is how to map the search range to the quadtree nodes. Given that $\forall q\in \mathcal{Q}$, $q.S{R}_{k}$ can be mapped to any set of quadtree nodes $NS=\{{n}_{1},{n}_{2},\cdots ,{n}_{i}\}$, only if the union of the spatial region corresponding to these nodes in $NS$ covers $q.S{R}_{k}$, which is also called that $q$ is associated with $NS$. Associating a query with the quadtree nodes is challenging because it affects two computation costs: (1) Verification cost, i.e., the cost of verifying the query with the objects falling in the associated nodes. (2) Update cost, i.e., the cost of inserting or deleting the query in or from the associated nodes. If the search range is organized by nodes with large regions, the index update cost is small, and the verification cost is large; otherwise, if multiple nodes with small regions are used, the situation is reversed. Therefore, a cost model is required to trade off the verification cost and update cost, and find the optimal associated nodes for CkQST.
Organizing the keywords of CkQST. When new objects arrive, the cost of verifying these objects with queries in spatial nodes is expensive. How to reduce the verification cost is the key to improve the filtering ability of the index. We discuss three aspects of constructing an inverted index. (1) For an inverted index, queries in posting lists are usually unordered. For the five queries in
Figure 1, we attached them to the posting list of a single keyword.
Figure 3a is the inverted index in which queries are attached to the posting list corresponding to the first query keyword, and
Figure 3b is the rankedkey inverted index in which queries are attached to the posting list corresponding to the least frequent keyword. If the incoming objects contain the corresponding term, all queries in posting lists are verified [
1,
2], which is inefficient. In this work, we use an ordered, inverted index to solve the above problem.
Figure 3c is the ordered, inverted index if queries are attached to the posting list corresponding to the first keyword, i.e., queries in posting lists are organized in the ascending order according to the keywords. When
${o}_{6}$ with keywords
$\{{w}_{1},{w}_{2},{w}_{3}\}$ arrives, the posting list corresponding to
${w}_{1}$ is verified. When
${o}_{6}$ is verified with
${q}_{2}$, its keywords are smaller than
${q}_{2}$, so we can terminate the verification early and speed up processing objects.
(2) Compared with the ordered, inverted index constructed by single keyword, the ordered, inverted index constructed by multiple keywords has more advantages. The length is shorter and the verification probability is smaller. As
Figure 3d shows,
${o}_{6}$ is verified with the first two posting lists and contains all the keywords of the queries in these posting lists. However, the number of posting lists might grow sharply. If the number of terms contained in vocabulary set
$\mathcal{V}$ is 1 M, and the number of query keywords are not more than 5, total 10
^{6×5} posting lists are required, which is difficult to implement by hash table due to the need for large continuous memory. Like other works in [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12], this paper uses the Map class in Microsoft Visual Studio [
19] to build the ordered, inverted index. Lemma 1 describes the verification efficiency with the number of keywords for constructing the ordered, inverted index.
Section 4.2 verifies this lemma through experiments. Based on the discussions, we select two keywords to construct the ordered, inverted index.
(3) Usually, there are many queries in posting lists, but only a small number match the incoming objects. Therefore, quickly locating the queries to be verified in posting lists is another way to improve the efficiency of evaluating CkQST. The queries in posting list are partitioned into multiple blocks such that objects are verified with the queries in a few blocks rather than the whole posting list. The only problem is how to partition these queries in posting lists. It is inefficient to have too many or too few queries in a block. An adaptive insertion strategy is proposed in
Section 3.3.
3.2. Ordered, Inverted Index
The formal definition of an ordered, inverted index constructed using two keywords follows. Queries are attached to the posting list of their first two keywords, and arranged in ascending order according to their keywords.
Definition 1 (Ordered Posting List/Ordered, Inverted Index). Given a set of queries ${q}_{1},{q}_{2},\cdots ,{q}_{i}$ to be inserted into a quadtree node, if ${q}_{1}.\psi [1:2]={q}_{2}.\psi [1:2]\cdots ={q}_{i}.\psi [1:2]=\{{w}_{i1},{w}_{i2}\}$ , and ${q}_{1}.\psi [3:]\le {q}_{2}.\psi [3:]\le \cdots \le {q}_{i}.\psi [3:]$, the posting list determined by the two terms ${w}_{i1},{w}_{i2}$ at the node is denoted as $P{L}_{{w}_{i1}{w}_{i2}}$, in which these queries are successively inserted. $P{L}_{{w}_{i1}{w}_{i2}}$is called an ordered posting list. Specifically, if these queries only contain one keyword, the corresponding posting list is denoted as$P{L}_{{w}_{i1}{w}_{i1}}$. All the ordered posting lists constitute the ordered, inverted index.
Hereafter the ordered posting list is abbreviated as posting list, if there is no ambiguity. To quickly locate the queries to be verified, posting lists are divided into multiple blocks.
Definition 2 (Block). Given any ordered posting list $P{L}_{{w}_{i1}{w}_{i2}}$,${w}_{i1}\ne {w}_{i2}$,${b}_{r}$ is the ${r}^{th}$ block of $P{L}_{{w}_{i1}{w}_{i2}}$. For any query $q$in${b}_{r}$,$q.\psi [3]\in [{b}_{r}.minw,{b}_{r}.maxw]$, where ${b}_{r}.minw={\mathrm{min}}_{{q}_{i}\in {b}_{r}}{q}_{i}.\psi [3]$, ${b}_{r}.maxw={\mathrm{max}}_{{q}_{i}\in {b}_{r}}{q}_{i}.\psi [3]$. ${b}_{r}.\psi $ denotes all the keywords satisfying ${b}_{r}.\psi =\left\{w\rightw\in [{b}_{r}.minw,{b}_{r}.maxw]\}$. Specially, if $q$ only contains one or two keywords, it is inserted into the block ${b}_{0}$ of the corresponding posting list.
Lemma 1. If the number of keywords for constructing an ordered, inverted index is
$m\left(m\ge 1\right)$, there are at most${\left\mathcal{V}\right}^{m}$posting lists at a node. For any object $o$ containing more than two keywords, the verification cost can be estimated by Equation (1). Where$\left\mathcal{B}\right$ is the number of blocks in a posting list, $\leftb\right$ is the number of queries in $b$, $Q$ contains queries whose keywords are contained in $o$, and the number of keywords is less than $m$.
The proof is shown in Appendix A. For
$\forall {w}_{j}\in b.\psi $, the verification probability, denoted as
${p}_{V}^{w}({w}_{j})$, is maintained, i.e., the probability of verifying these queries subjected to
${q}_{i}.\psi [3]=\left\{{w}_{j}\right\}$ in
${b}_{r}$. For
$\forall {b}_{r}$, the verifying probability, denoted as
${p}_{V}^{B}({b}_{r})$, is maintained, i.e., the probability that the block
${b}_{r}$ is verified, which can be estimated by Equation (2).
The following theorems claim that the incoming object is verified with few queries in posting lists. For any incoming object, the blocks being verified can be located according to block keyword interval $[{b}_{r}.minw,{b}_{r}.maxw]$ (see Theorem 1). The object verification with a block or a posting list can be terminated if the keywords are smaller than that of some query being verified. (see Theorem 2).
Theorem 1. Given$\forall o\in \mathcal{O}$and a posting list$P{L}_{{w}_{i1}{w}_{i2}}$being verified with$o$, for$\forall {b}_{r}$in$P{L}_{{w}_{i1}{w}_{i2}}$, if$\exists q$in${b}_{r}$, $o$contains all keywords of$q$, only if$\exists {w}_{j}\in o.\psi $, $\text{}{w}_{j}\in [{b}_{r}.minw,{b}_{r}.maxw]$.
Proof. Suppose that $o$ contains all the keywords of $q$, but there is no term in $o.\psi $ satisfying ${w}_{j}\in [{b}_{r}.minw,{b}_{r}.maxw]$. Based on Definition 2, $\text{}q.\psi [3]\in [{b}_{r}.minw,{b}_{r}.maxw]$, $o$ does not contain the third keyword of $q$, so $o$ does not contain all the keywords of $q$. It is contradicted by the hypothesis, so the theorem is proved. □
Theorem 2. Given$\forall o\in \mathcal{O}$and a posting list$P{L}_{{w}_{i1}{w}_{i2}}$being verified with$o$, for$\forall {b}_{r}$in$P{L}_{{w}_{i1}{w}_{i2}}$, we have the following conclusions. (1) If${b}_{r}.minw>\mathrm{max}\left\{w\in o.\psi \right\}$, $o$is not the result of queries in the blocks starting from${b}_{r}$, and$P{L}_{{w}_{i1}{w}_{i2}}$can be terminated verifying. Specifically, for$\forall {q}_{i}$ in ${b}_{r}$, if ${q}_{i}.\psi [3]>\mathrm{max}\left\{w\in o.\psi \right\}$, $P{L}_{{w}_{i1}{w}_{i2}}$ can be terminated verifying. (2) If ${b}_{r}.maxw<\mathrm{min}\left\{w\in o.\psi w>{w}_{i2}\right\}$, $o$ is not the result of queries in ${b}_{r}$.
Proof. (1) If ${b}_{r}.minw={\mathrm{min}}_{{q}_{i}\in {b}_{r}}{q}_{i}.\psi [3]>\mathrm{max}\{w\in o.\psi \}$, i.e., for any ${q}_{i}$ in ${b}_{r}$, $o$ does not contain its third keyword, $o$ is not the result of ${q}_{i}$. So does the block ${b}_{r+j}$, since ${b}_{r+j}.minw>{b}_{r}.minw>\mathrm{max}\{w\in o.\psi \}$. Similarly, for any ${q}_{i}$ in ${b}_{r}$, ${q}_{i+j}.\psi [3]>{q}_{i}.\psi [3]>\mathrm{max}\{w\in o.\psi \}$, $o$ is not the result of queries after ${q}_{i}$.
(2) If ${b}_{r}.maxw={\mathrm{max}}_{{q}_{i}\in {b}_{r}}{q}_{i}.\psi [3]<\mathrm{min}\left\{w\in o.\psi w>{w}_{i2}\right\}$, $o$ is not the result of the queries in ${b}_{r}$. □
3.3. Adaptive Query Insertion Algorithm
Given any posting list at a node, we consider two extreme situations: (1) the posting list only contains a block which contains all queries; (2) the posting list contains many blocks, each of which only contain one query. The former has poor filtering ability and the latter has high update cost. Neither is what we expect. In the real world, people are concerned with different interests and often pay high attention to the breaking news or topical issues, so the keywords of the queries and objects vary over time. For each query, we adaptively insert it into the posting lists according to the historical queries and objects. We expect that the increase of the verification cost and update cost of the posting list is minimal after the query being inserted.
Given a posting list
$P{L}_{{w}_{i1}{w}_{i2}}$, the update cost is denoted as
${C}_{U}^{PL}(P{L}_{{w}_{i1}{w}_{i2}})$, and the verification cost is denoted as
${C}_{V}^{PL}(P{L}_{{w}_{i1}{w}_{i2}})$, which can be estimated by Equation (3), where
${C}_{V}^{B}({b}_{r})={p}_{V}^{B}({b}_{r})\ast (log\left\mathcal{B}\right+\left{b}_{r}\right)$ represents the verification cost of the block
${b}_{r}$ in
$P{L}_{{w}_{i1}{w}_{i2}}$.
Theorem 3. Let$q$be the query to be inserted into$P{L}_{{w}_{i1}{w}_{i2}}$, $q.\psi [1:3]=\{{w}_{i1},{w}_{i2},{w}_{j}\}$, $P{L}_{{w}_{i1}{w}_{i2}}$has$\left\mathcal{B}\right(\left\mathcal{B}\right\ge 1)$blocks, $\Delta {C}_{V}^{PL}$ is the increase of verification cost of$P{L}_{{w}_{i1}{w}_{i2}}$after$q$being inserted. We have the following conclusions.
Case 1: If $\exists {b}_{r}\in P{L}_{{w}_{i1}{w}_{i2}}$ satisfies ${w}_{j}\in \left[{b}_{r}.minw,{b}_{r}.maxw\right]$, $q$ is inserted into ${b}_{r}$. $\Delta {C}_{V}^{PL}={p}_{V}^{B}({b}_{r})$.
Case 2: If $\nexists b\in P{L}_{{w}_{i1}{w}_{i2}}$ satisfies ${w}_{j}\in [b.minw,b.maxw]$, but $\exists {b}_{r}\in P{L}_{{w}_{i1}{w}_{i2}}$ satisfies ${b}_{r}.maxw<{w}_{j}$, $q$ is inserted into the tail of ${b}_{r}$. The updated block is denoted as ${b}_{r}{}^{\prime}$. $\Delta {C}_{V}^{PL}=({p}_{V}^{B}({b}_{r}{}^{\prime}){p}_{V}^{B}({b}_{r}))\ast (log\left\mathcal{B}\right+\left{b}_{r}\right)+{p}_{V}^{B}({b}_{r}{}^{\prime})$.
Case 3: Similar to case 2, if $\nexists b\in P{L}_{{w}_{i1}{w}_{i2}}$ satisfies ${w}_{j}\in [b.minw,b.maxw]$, but $\exists {b}_{r}\in P{L}_{{w}_{i1}{w}_{i2}}$ satisfies ${w}_{j}<{b}_{r}.minw$, $q$ is inserted into the head of the ${b}_{r}$. $\Delta {C}_{V}^{PL}=({p}_{V}^{B}({b}_{r}{}^{\prime}){p}_{V}^{B}({b}_{r}))\ast (log\left\mathcal{B}\right+\left{b}_{r}\right)+{p}_{V}^{B}({b}_{r}{}^{\prime})$.
Case 4: If $\nexists b\in P{L}_{{w}_{i1}{w}_{i2}}$ satisfies ${w}_{j}\in [b.minw,b.maxw]$, a new block $b$ is constructed in $P{L}_{{w}_{i1}{w}_{i2}}$, and $q$ is inserted into $b$.$\Delta {C}_{V}^{PL}=\mathrm{log}((\left\mathcal{B}\right+1)/\left\mathcal{B}\right)\ast {\sum}_{{b}_{r}}{p}_{V}^{B}({b}_{r})+{p}_{V}^{w}({w}_{j})\ast \mathrm{log}(\left\mathcal{B}\right+1)$.
Proof. (1) Case 1: If
$q$ is inserted into
${b}_{r}$, the verifying probability
${p}^{B}({b}_{r})$ does not change since
${w}_{j}\in \left[{b}_{r}.minw,{b}_{r}.maxw\right]$, but the number of queries in
${b}_{r}$ increases by 1.
(3) Similar to case 2.
(4) Case 4: For any
${b}_{i}$ in
$P{L}_{{w}_{i1}{w}_{i2}}$,
${C}_{V}^{B}\left({b}_{i}\right)={p}_{V}^{B}\left({b}_{i}\right)\ast \left(\mathrm{log}\left\mathcal{B}\right+\left{b}_{i}\right\right)$,
${C}_{V}^{B}{}^{\prime}\left({b}_{i}\right)={p}_{V}^{B}\left({b}_{i}\right)\ast \left(\mathrm{log}\left(\left\mathcal{B}\right+1\right)+\left{b}_{i}\right\right)$Theorem 3 shows all the cases where the verification costs increase if a query is inserted into the posting list. The increase of update costs $\Delta {C}_{U}^{PL}$ corresponding to the above four cases are $O(\left{b}_{r}\right+1)$, $O(1)$, $O(1)$, and $O(1+log\left\mathcal{B}\right)$ respectively. To compare the verification costs and update costs, we introduce a normalization parameter ${\theta}_{U}(0<{\theta}_{U}\le 1)$ to represent the ratio of the update operation to the verification operation, i.e., if a query is inserted into a node, there will be $1/{\theta}_{U}$ objects being verified with it. A query is adaptively inserted into the posting list according to the following theorem. □
Theorem 4. Let$q$ be the query to be inserted into$P{L}_{{w}_{i1}{w}_{i2}}$, $q.\psi [1:3]=\{{w}_{i1},{w}_{i2},{w}_{j}\}$. If$\exists {b}_{r}$in$P{L}_{{w}_{i1}{w}_{i2}}$ satisfies ${w}_{j}\in [{b}_{r}.minw,{b}_{r}.maxw]$, $q$is inserted into${b}_{r}$. Otherwise, the minimum$\Delta {C}_{V}^{PL}+{\theta}_{U}\Delta {C}_{U}^{PL}$of the cases 2–4 is taken.
Algorithm 1: $InsertQueryPL(q,P{L}_{q.\psi [1]q.\psi [2]})$ 
1 if $\nexists P{L}_{q.\psi [1]q.\psi [2]}$ then 
2 construct new block b in $P{L}_{q.\psi [1]q.\psi [2]}$insert q into b; return;

3 ${b}_{r}\leftarrow \mathrm{min}\left\{b\rightb.minw\ge q.\psi [3]\}$;

4 if $q.\psi [3]=={b}_{r}.minw$ then q is inserted into b_{r}; return;

5 if $r>1$and $q.\psi [3]\le {b}_{r1}.\mathrm{maxw}$ then

6 q is inserted into b_{r−1}; return;

7 if $\nexists {b}_{r}$ then compute $\mathrm{min}{\{\Delta {C}_{V}^{PL}+{\theta}_{U}\Delta {C}_{U}^{PL}\}}_{case2,case4}$ on ${b}_{\left\mathcal{B}\right}$;

8 if r == 1 then compute $\mathrm{min}{\{\Delta {C}_{V}^{PL}+{\theta}_{U}\Delta {C}_{U}^{PL}\}}_{\mathrm{case}3,case4}$;

9 if r > 1 then compute $\mathrm{min}{\{\Delta {C}_{V}^{PL}+{\theta}_{U}\Delta {C}_{U}^{PL}\}}_{\text{case2,case3,case}4}$;

10 if case 2 then q is inserted into the tail of b_{r−1};

11 if case 3 then q is inserted into the head of b_{r};

12 if case 4 inserted into a new block.

Given $\forall q\in \mathcal{Q}$, if $q$ contains no more than two keywords, it is directly inserted into the ${b}_{0}$ of the corresponding posting list. Algorithm 1 shows how a query $q$ containing more than two keywords is adaptively inserted into a posting list. If $P{L}_{q.\psi [1]q.\psi [2]}$ does not exist, a new block $b$ is constructed, and $q$ is inserted into $b$ (lines 1–2). Otherwise, a block in $P{L}_{q.\psi [1]q.\psi [2]}$ is found for $q$ to minimize $\Delta {C}_{V}^{PL}+{\theta}_{U}\Delta {C}_{U}^{PL}$ (lines 3–12). First, we find the block, denoted as ${b}_{r}$, whose ${b}_{r}.minw$ is the smallest—no smaller than $q.\psi [3]$ (line 3). If $q.\psi [3]=={b}_{r}.minw$, $q$ is inserted into ${b}_{r}$ (line 4). If $q.\psi [3]\in [{b}_{r1}.minw,{b}_{r1}.maxw]$ ($r>1$), $q$ is inserted into block ${b}_{r1}$ (lines 5–6). Otherwise, we compute $\Delta {C}_{V}^{PL}+{\theta}_{U}\Delta {C}_{U}^{PL}$ according to cases 2–4 in Theorem 3 and select the minimum case (lines 7–12). It is worth noting that when compared with the first block of the list, there are only cases 3–4, and if $q.\psi [3]$ is larger than ${b}_{r}.minw$ of all the blocks, there are only cases 2 and 4.
Computation complexity. In the worst case, the computation cost of Algorithm 1 is shown as Lemma 1. That is, in posting lists constructed by two keywords, the complexity of inserting a query at a node is $\mathrm{O}({\lefto.\psi \right}^{2}\mathrm{log}{\left\mathcal{V}\right}^{2}+{\lefto.\psi \right}^{3}(\mathrm{log}\left\mathcal{B}\right+\leftb\right/\left\mathcal{B}\right\leftb.\psi \right\left)\right)$. The algorithm can adaptively adjust $\left\mathcal{B}\right$ and $\leftb\right$.
3.4. The MemoryBased Cost Model
A memorybased cost model associates queries with the optimal quadtree nodes. Given the search range of CkQST, the model traversals the quadtree from the root node, compares the sum of the verification cost and index update cost if the query is associated with the current node and its child nodes, and selects the smaller one. The verification cost is the product of the number of verified objects and the expectation of the verification cost, and the update cost is the expectation of the update cost if the query is inserted into the corresponding block of the posting list.
Definition 3 (Minimum Bounding Node). Given
$\forall q\in \mathcal{Q}$ and search range $q.S{R}_{k}$, if $\exists N$,$q.S{R}_{k}\subseteq N.R$, and for any child node ${n}_{i}$ of $N$,$q.S{R}_{k}\u2288{n}_{i}.R$, $N$ is the minimum bounding node of $q$ , where $N.R$, $n.R$ are the region where $N$ and $n$ locate respectively. The minimum bounding node of $q$ is the node, which covers its search range, but any of its child nodes cannot completely cover the search range.
Verification cost. Given
$\forall q\in \mathcal{Q}$ and its minimum bounding node
$N$,
$q.\psi [1:3]=\{{w}_{i1},{w}_{i2},{w}_{i3}\}$, if
$q$ is associated with
$N$, the verification cost within unit time interval, denoted as
${C}_{V}^{q}\left(q,N\right)$, can be estimated by Equation (4). We assume that the query and object contain more than two keywords, and the average verification cost is unit time.
Specifically, if the query or the object contains one or two keywords, the verification cost is estimated by Equation (5). This case is simple, so we omit the details.
where
$Nu{m}_{o}^{N}(N)$ is the number of objects falling in
$N$ within the unit time interval.
${p}_{V}^{q}\left(qN\right)$ is the probability that
$q$ is verified if it is inserted into
${b}_{r}$ in
$N$, i.e., the probability that the objects contain the terms
${w}_{i1},{w}_{i2}$, and
${w}_{j}\in [{w}_{i3},{b}_{r}.maxw]$, and can be estimated by Equation (6).
where
$\left{b}_{r}.\psi \right$ is the number of keywords contained in
${b}_{r}.\psi $.
${E}_{V}\left(qN\right)$ is the verification cost if
$q$ is inserted into
${b}_{r}$ in
$N$, and can be estimated with the expectation of verification cost of the queries in
${b}_{r}$, i.e., Equation (7).
where
${(Nu{m}_{q})}_{\le {w}_{j}}$ is the number of queries subjected to
${q}_{i}.\psi [3]\le {w}_{j}$ in
${b}_{r}$. Similarly, if the query
$q$ is associated with a set of nonoverlapping nodes, denoted as
$NS$, the verification cost is denoted as
${C}_{V}^{q}\left(q,NS\right)$ and can be estimated by Equation (8).
For
$\forall q\in \mathcal{Q}$, we find the optimal associated nodes starting from its minimum bounding node, and check whether the query is associated with the current node or associated with its child nodes. The difference of two verification costs is estimated by Equation (9).
where
$INS$ keeps the intermediate result,
$n.child$ contains the child nodes of
$n$ that intersect with the search range of the query. It is worth noting that if
$\Delta {C}_{V}^{q}\le 0$, we terminate the iteration.
Update cost. When inserting or deleting queries in nodes, it will incur an index update cost. We delay deleting queries until these queries are accessed again, so the deletion cost is ignored. If a query
$q$ is associated with its minimum bounding node
$N$, and is inserted into a block
${b}_{r}$ of posting list
$P{L}_{q.\psi [1]q.\psi [2]}$ in
$N$, the insertion cost consists of two parts, the time to find the corresponding block and the time to find the insertion position. The update cost, denoted as
${C}_{U}^{q}\left(q,N\right)$, can be estimated by Equation (10).
If
$q$ is associated with a set of nonoverlapping nodes, denoted as
$NS$, the update cost is denoted as
${C}_{U}^{q}\left(q,NS\right)$ and can be estimated by Equation (11).
Similarly to
$\Delta {C}_{V}^{q}$, the difference of two update costs between the query being associated with the node and associated with the child nodes is estimated by Equation (12).
Given $\forall q\in \mathcal{Q}$ and search range $q.S{R}_{k}$, we start from the minimum bounding node, and computes $\Delta {C}_{V}^{q}$ and $\Delta {C}_{U}^{q}$ between the query being associated with the node and associated with the child nodes. If $\Delta {C}_{V}^{q}\ge {\theta}_{U}\cdot \Delta {C}_{U}^{q}$, the child nodes are the optimal. Otherwise the node is optimal. The computation cost consists of two parts, finding the minimum bounding node of $q$, and finding an optimal association in the descendant nodes of minimum bounding node. The computation cost of the first part is $O({\theta}_{h})$, and the second part is $O(({4}^{{\theta}_{h}}1)/3)$, i.e., in the worst case, the node will be partitioned until the leaf node, where ${\theta}_{h}$ is the height of the quadtree.
3.5. Processing Objects in Batches
For these objects being verified with the same posting list, an object processing algorithm, which is a group matching technique that follows the filtering and verification strategy, is proposed to process objects in batches.
A data structure $\u2329bid,w,oset\u232a$ is defined to group the objects being verified with the same posting list, where $bid(bid>0)$ is the block id, $w(w\in [{b}_{bid}.minw,{b}_{bid}.maxw])$ is a term, $oset$ is a set of objects being verified with block ${b}_{bid}$ and containing $w$. For the convenience of the description, $wse{t}_{bid}$ is a set of terms satisfying $w\in [{b}_{bid}.minw,{b}_{bid}.maxw]$, $ose{t}_{bid,w}$ is a set of objects which are verified with the queries in block ${b}_{bid}$ and contain $w$.
Algorithm 2 describes how to process a set of objects represented by $\u2329bid,w,oset\u232a$, which are verified with the queries in the posting list $P{L}_{{w}_{i1}{w}_{i2}}$. If ${b}_{bid}$ is ${b}_{0}$, the queries in ${b}_{0}$ is verified with the objects in $oset$ (lines 1–5). Otherwise, for any term ${w}_{j}$ in $wse{t}_{bid}$, according to Theorem 2, if ${b}_{bid}.minw>{w}_{j}$, we check the next term (line 7); if ${b}_{bid}.maxw<{w}_{j}$, we check the next block (line 8); otherwise, for each query $q$ in ${b}_{bid}$, if $q.\psi [3]>{w}_{j}$, we check the next term (line 10); if $q.\psi [\leftq.\psi \right]<{w}_{j}$, we check the next query (line 11); otherwise, we verify whether the object is the results of $q$. If yes, $\u2329q,o\u232a$ is added to $QOS$ (lines 12–13). Moreover, the result list and search range of $q$ are updated.
Computation complexity. Algorithm 2 describes how to process objects in batches in an ordered posting list. In the worst case, objects are processed individually. As Lemma 1 shows, in posting lists constructed by two keywords, for an object, the time complexity of finding the qualified queries at a node is $\mathrm{O}({\lefto.\psi \right}^{2}\mathrm{log}{\left\mathcal{V}\right}^{2}+{\lefto.\psi \right}^{3}(\mathrm{log}\left\mathcal{B}\right+\leftb\right/\left\mathcal{B}\right\leftb.\psi \right\left)\right)$.
Algorithm 2: $\text{}\mathrm{ObjectProcessing}(P{L}_{{w}_{i1}{w}_{i2}},\{\u2329bid,w,oset\u232a\})$.


3.6. CostBased kSkyband Technique
To reduce the reevaluation cost, a costbased kskyband technique is proposed to judiciously determine an optimal search range for CkQST such that the overall cost defined in the cost model can be minimized. Specifically, for $\forall q\in \mathcal{Q}$, three parameters are defined: an extended search range is denoted as $q.SR$, where $q.SR\supseteq q.S{R}_{k}$; a kskyband, i.e., an extended result list, denoted as $q(\mathcal{O})$, where $q(\mathcal{O})\supseteq q{(\mathcal{O})}_{k}$; the number of objects containing all query keywords within $q.SR$ in the initial timestamp is denoted as $q.{\theta}_{k}$, where $q.{\theta}_{k}\ge q.k$.
Definition 4 (Loose Matching). Given $\forall o\in \mathcal{O}$ and $\forall q\in \mathcal{Q}$,$o$ loosely matches $q$ only if $q.\psi \subseteq o.\psi $ and $o.loc\in q.SR$. All the objects that loosely match $q$ aredenoted as $q{(\mathcal{O})}_{\mathrm{sup}}=\{o\in \mathcal{O}q.\psi \subseteq o.\psi \wedge o.loc\in q.SR\}$.
Definition 5 (Dominance). Given $\forall q\in \mathcal{Q}$and two objects ${o}_{1}$,${o}_{2}$, which loosely match $q$,${o}_{1}$ dominates ${o}_{2}$ only if $dist(q,{o}_{1})<dist(q,{o}_{2})$ and${o}_{1}.{t}_{e}\ge {o}_{2}.{t}_{e}$ or $dist(q,{o}_{1})\le dist(q,{o}_{2})$ and ${o}_{1}.{t}_{e}>{o}_{2}.{t}_{e}$.
Definition 6 (kskyband/Extended Result list). Given $\forall q\in \mathcal{Q}$, for any incoming object ${o}^{\prime}$, we insert it into the kskyband $q(\mathcal{O})$only if: (1) ${o}^{\prime}$loosely matches $q$; (2) ${o}^{\prime}$ is dominated by less than $k$ other objects. For$\forall q\in \mathcal{Q}$, ifan object ${o}^{\prime}$ loosely matches $q$, and there are $k$ objects in $q(\mathcal{O})$ dominating${o}^{\prime}$,${o}^{\prime}$would not be a result at any timestamp. Therefore, we would not insert these objects into the result list.
Theorem 5. Given$\forall q\in \mathcal{Q}$ and an extended search range$q.SR$, we always have the following conclusions. (1)$q{(\mathcal{O})}_{k}\subseteq q(\mathcal{O})$; (2)$q(\mathcal{O})\subseteq q{(\mathcal{O})}_{\mathrm{sup}}$; (3)$\leftq(\mathcal{O})\right<k$iff$\leftq{(\mathcal{O})}_{\mathrm{sup}}\right<k$.
Proof. (1) At any timestamp, for $\forall o\in q{(\mathcal{O})}_{k}$, we have $q.\psi \subseteq o.\psi $ and $o.loc\in q.S{R}_{k}\subseteq q.SR$, i.e., $o$ loosely matches $q$, and less than $k$ objects dominate $o$. So $o\in q(\mathcal{O})$. (2) According to Definition 4, for $\forall o\in q(\mathcal{O})$, $o$ loosely matches $q$, so $o\in q{(\mathcal{O})}_{\mathrm{sup}}$, $q(\mathcal{O})\subseteq q{(\mathcal{O})}_{\mathrm{sup}}$. (3) Since $q(\mathcal{O})\subseteq q{(\mathcal{O})}_{\mathrm{sup}}$, if $\leftq{(\mathcal{O})}_{\mathrm{sup}}\right<k$, we have $\leftq(\mathcal{O})\right\le \leftq{(\mathcal{O})}_{\mathrm{sup}}\right<k$. On the other hand, if $\leftq(\mathcal{O})\right<k$, $\leftq{(\mathcal{O})}_{\mathrm{sup}}\right\ge k$, i.e., $\exists o\in q{(\mathcal{O})}_{\mathrm{sup}}$ and $o\notin q(\mathcal{O})$, which means $o$ loosely matches $q$, but $o$ is dominated by more than $k$ other objects, which is contradicted by $\leftq(\mathcal{O})\right<k$. The theorem is proved. □
According to Theorem 5, the extended result list is the super set of the exact result list, from which we can extract the kNN objects, and the number of objects in extended result list is less than $k$ only if the number of objects in $q{(\mathcal{O})}_{\mathrm{sup}}$ is less than $k$.
Given $\forall q\in \mathcal{Q}$, an extended search range $q.SR$, and the corresponding extended result list $q(\mathcal{O})$, three costs are defined in the costbased kskyband technique: the verification cost of $q$ within $q.SR$, the update cost of $q(\mathcal{O})$, and the reevaluation cost.
Verification cost. The verification cost of
$q$ within
$q.SR$, within the unit time interval, denoted as
${C}_{V}^{R}\left(qq.SR\right)$, is estimated by Equation (13), i.e., the verification cost if
$q$ is inserted into all the leaf nodes that intersect with
$q.SR$.
Update cost. Similarly to
$q{(\mathcal{O})}_{k}$, the extended result list
$q(\mathcal{O})$ is organized by a linked list. For
$\forall o\in q(\mathcal{O})$, a dominance counter is defined to count the number of objects dominating
$o$. If an incoming object
${o}^{\prime}$ is inserted into
$q(\mathcal{O})$, the dominance counters of all the objects in
$q(\mathcal{O})$ with
$dist(q,{o}^{\prime})<dist(q,o)$ and
${o}^{\prime}.{t}_{e}\ge o.{t}_{e}$, or
$dist(q,{o}^{\prime})\le dist(q,o)$ and
${o}^{\prime}.{t}_{e}>o.{t}_{e}$ will increase by 1, and the objects with dominance counter equal to
$k$ will be evicted, which can be processed in
$O(\leftq(\mathcal{O})\right)$ time. When an object in
$\leftq(\mathcal{O})\right$ expires, it is deleted from
$q(\mathcal{O})$ until it is accessed again. The cost is negligible. The update cost of
$q(\mathcal{O})$ within unit time interval, denoted as
${C}_{U}^{L}(qq(\mathcal{O}))$, is estimated by Equation (14). Where
$fre{q}_{U}^{o}$ is the number of object updates within unit time interval,
${p}_{M}^{q}(q.SR)$ is the probability that the objects loosely match
$q$ within the search range
$q.SR$, and is estimated by
${p}_{M}^{q}(q.SR)=q.{\theta}_{k}/Nu{m}_{o}^{N}({n}_{0})$,
$1/2$ is the probability that a qualified object arrival,
$\leftq(\mathcal{O})\right$ can be estimated by
$\leftq(\mathcal{O})\right=\mathrm{max}\{k\cdot \mathrm{ln}({\theta}_{k}/k),{\theta}_{k}\}$.
Reevaluation cost. The reevaluation cost within the unit time interval is denoted as $1/{\theta}_{t}{C}_{Ie}\left(q\right)$. Where ${\theta}_{t}$ is the reevaluation period, i.e., the shortest time required between two consecutive independent evaluations, and $1/{\theta}_{t}$ is the frequency of reevaluation. ${C}_{Ie}\left(q\right)$ is the reevaluation cost, and is approximated to the verification cost in $q.S{R}_{k}$, i.e., ${C}_{Ie}\left(q\right)={C}_{V}^{R}\left(qq.S{R}_{k}\right)={\displaystyle {\sum}_{n.R\cap q.S{R}_{k}\ne \varnothing}{C}_{V}^{q}\left(q,n\right)}$.
The overall cost in the costbased kskyband technique, denoted as
${C}_{Re}\left(q\right)$, is shown in Equation (15). When
${C}_{Re}\left(q\right)$ is minimal, the search range is optimal.
In the following, we discuss how to get
${\theta}_{t}$.
${\theta}_{t}$ is the reevaluation period, i.e., the shortest time that
$\leftq(\mathcal{O})\right$ is reduced to
$k1$ since the last reevaluation. For
$\forall q\in \mathcal{Q}$, the update process of number of objects in
$q(\mathcal{O})$ can be modeled as a simple random walk, which is a stochastic sequence
${S}_{l}$, with
${S}_{0}$ being the original status, defined by
${S}_{l}={\displaystyle {\sum}_{i=1}^{l}{X}_{i}}$, where
${X}_{i}$ is the object update, which is an independent and identically distributed random variable. In
$q(\mathcal{O})$, if an object is inserted,
${X}_{i}=1$; if an object expires or is dominated,
${X}_{i}=1$; otherwise
${X}_{i}=0$. It’s difficult to estimate
${X}_{i}$ due to the eviction of objects by the dominance relationship in
$q(\mathcal{O})$. For example, an object is inserted, but the number of objects decreases due to the eviction of objects with dominance counters reaching
$k$. According to Theorem 5, the number of objects in
$q(\mathcal{O})$ is less than
$k$ only if the number of objects in
$q{(\mathcal{O})}_{\mathrm{sup}}$ is less than
$k$, and the objects in
$q{(\mathcal{O})}_{\mathrm{sup}}$ don’t dominate each other. Therefore we estimate the shortest time that
$\leftq{(\mathcal{O})}_{\mathrm{sup}}\right$ is reduced to
$k1$, denoted as
${\theta}_{t}{}^{\prime}$, where
${\theta}_{t}{}^{\prime}={\theta}_{t}$. The object update in
$q{(\mathcal{O})}_{\mathrm{sup}}$ at any timestamp can be estimated as Equation (16).
The number of object updates required to reduce the number of objects from
$q.{\theta}_{k}$ to
$k1$ in
$q(\mathcal{O})$, denoted as
$\mathbb{Z}(q)$, is estimated by Equation (17).
${\theta}_{t}$ is estimated by
${\theta}_{t}=\mathbb{Z}(q)/fre{q}_{U}^{o}$.
For $\forall q\in \mathcal{Q}$, the variables in Equation (15) are $q.SR$ and $q.{\theta}_{k}$. To minimize Equation (15), we employ the incremental estimation algorithm to compute the optimal $q.{\theta}_{k}$ and the corresponding $q.SR$.
To accommodate our extended search range with the objects processing algorithm and index construction and maintenance algorithm, we replace $q{(\mathcal{O})}_{k}$ with $q(\mathcal{O})$ and replace $q.S{R}_{k}$ with $q.SR$.