We provide definitions to describe the basic concept of high utility itemset mining,
and then, describe the proposed utility list structure and bitwise operation-based
BHUI algorithm in detail.
3.1 Problem Definition
Database D = {T1,T2,..,Tn} is a set of transactions, and I = {i1,i2,..,il} is a set
of all items used in D. Each transaction consists of a set of items, Ti={i1,i2,..,im}.
Each transaction is identified with an individual identifier, and each item ip included
in the transaction Tj is characterized with the purchase quantity and profit value.
The purchase quantity, which is denoted as q(ip,Tj), represents the total quantity
of ip in Tj. The profit value, which is described as pr(ip), means the profit of ip.
An itemset X={i1,i2,..,ik} is a set of joined items. If the length of X is k, then
it is called k-item.
Table 1. A transaction database
TID
|
a
|
b
|
c
|
d
|
e
|
f
|
g
|
tu
|
T1
|
1
|
|
|
2
|
|
|
1
|
9
|
T2
|
2
|
2
|
1
|
2
|
|
|
|
22
|
T3
|
|
3
|
1
|
2
|
2
|
|
|
28
|
T4
|
2
|
|
|
1
|
|
1
|
|
10
|
Table 2. Profit value
Item
|
a
|
b
|
c
|
d
|
e
|
f
|
g
|
Price
|
3
|
6
|
2
|
1
|
3
|
3
|
4
|
Table 1 shows an example of database consisting of 4 transactions and 7 items. Table 2 shows the profit value of each item. We use this example to explain our proposed
high utility itemset (HUI) mining algorithm.
Definition 1 (Item utility): Utility of an item $i_{p}$ in a transaction $T_{j}$ is
denoted by $iu(i_{p},\:T_{j})$ and defined as the following equation (1):
Definition 2 (Itemset utility): Utility of an itemset $X$ in a transaction $T_{j}$
is denoted by $iu(X,\: T_{j})$ and defined as the following equation (2):
For instance, if itemset X={a,b}, iu(ab, T2) = iu(a,T2) + iu(b,T2) = 6 + 12 = 18.
Definition 3 (Sum of item utility): Utility of an itemset X in the database D is denoted
by iu(X) and defined as the following equation (3):
Definition 4 (Remaining utility): Items following the itemset that belongs to $X$
in a transaction $T_{j}$ are the remaining items. The utility of the remaining items
is denoted by $ru(X,\: T_{j})$ and defined as the following equation (4):
where $R$ is a set of items following $X$. For instance, $ru(bcd,\: T_{3})$$= iu(e,\:
T_{3})= 6$.
Definition 5 (Transaction utility): Transaction utility of $T_{j}$ is denoted by $tu(T_{j})$
and defined as the following equation (5):
For instance, T4 = {a,d,f}, tu(T4) = iu(a,T4) + iu(d,T4) + iu(f,T4) = 6 + 1 + 3 =
10.
Definition 6 (Transaction weighted utility): Transaction weighted utility of $X$ in
$D$ is the total transaction utility of all transactions containing $X$. It is denoted
by $TWU(X)$ and defined as the following equation (6):
3.2 BHUI Algorithm
Now, we describe our BHUI algorithm. The core function of BHUI is to find all set
of high utility itemsets, which is denoted as HUI. HUI is initialized as ∅. Without
loss of generality, the sum of all transaction utilities in $D$, which is $TU(D)=\sum
tu(T_{j})$ for all $T_{j}\in D$, is the utility of $D$. Min_util is a minimum threshold
to become a high utility itemset and it is determined as min_util = $TU(D)\times\delta$
for a predefined threshold value $\delta$. $\delta$ can be either systemically predefined
or given from an expert user. Suppose that $\delta$ is 0.3. Thus, min_util = $69 *
0.3 = 20.7$ in our example.
First of all, the proposed algorithm initially calculates the $TWU(i_{p})$ for each
$i_{p}∈ I$ in order to discard items that do not satisfy min_util. This is to find
high utility itemsets with a length of 1, called 1-item. In our example, $TWU(a)=
tu(T_{1})+$ $tu(T_{2})+ tu(T_{4})= 41$, $TWU(b)= 50$, $TWU(c)= 50$, $TWU(d)= 69$,
$TWU(e)= 28$, $TWU(f)= 10$, and $TWU(g)= 9$. As a result, {e}, {a}, {c}, {b} and {d}
are determined as high utility itemsets of length 1, and added to HUI. For the chosen
itemsets, $tu(T_{i})$ of each transaction $T_{i}$ is reconstructed as shown in Table 3. $tu(T_{1})$ are $tu(T_{4})$ are updated because $f$ and $g$ are discarded from items.
Table 3. Reconstructed database
TID
|
e
|
a
|
c
|
b
|
d
|
tu
|
T1
|
|
3
|
|
|
2
|
5
|
T2
|
|
6
|
2
|
12
|
2
|
22
|
T3
|
6
|
|
2
|
18
|
2
|
28
|
T4
|
|
6
|
|
|
1
|
7
|
Once high utility itemsets are determined, the utility list structure (ULS) of each
itemset is generated. As shown in Fig. 1, the ULS of each itemset consists of four attributes: tid, iu, ru, and cu. For an
itemset X, tid indicates a transaction identifier containing X, iu is determined by
iu(X, Ttid), ru is determined by ru(X, Ttid), and cu denotes a common utility of X
that is defined as follows:
Definition 7 (Common utility): Common utility of an itemset $X^{k}$ of length $k$
is denoted by $cu(X^{k})$ and defined as $iu(X^{k-1})$, where $X^{k-1}$ consists of
the first $k-1$ items of $X^{k}$. $cu(X)$ for $X$ with a length of 1 is initialized
as 0.
For instance, $cu(ab)$ of $X=\{a,\:b\}$ is $iu(a)$.
The ULS is created in an ascending order of $TWU(X)$, such as $e < a < c < b < d$.
Once ULS constructing for initial high utility itemset is completed, the proposed
high utility itemset mining (HUI_Mining) algorithm is performed repeatedly. The mining
algorithm finds the next level of high utility itemsets. That is, 2-item sets that
are itemsets of length 2 are generated by joining (or combining) two 1-item itemsets.
A joined itemsest $X$ of $X_{i}$ and $X_{j}$ is defined as $X = X_{i}∪ X_{j}$. The
join process is also peformed in an ascending order of TWU. Since $TWU(e)< TWU(a)<
TWU(c)< TWU(b)< TWU(d)$, all possible combinations for 2-item are $\{e,\:a\}$, $\{e,\:c\}$,
$\{e,\:b\}$, $\{e,\:d\}$, $\{a,\:c\}$, $\{a,\:b\}$, $\{a,\:d\}$, $\{c,\:b\}$, $\{c,\:d\}$,
and $\{b,\:d\}$.
At this point, constructing ULS of all possible itemsets of 2-item is very time consuming.
If we can extract itemsets that have high join possibility, which are called joinable
itemsets, the HUI mining time will be reduced. The join possibility indicates the
likelihood that the joined itemset will be a high utility itemset. In order for a
joined itemset to become a high utility itemset, a common transaction, which contains
all of the joined items, must exist. Thus, the suggested strategy is to use bitwise
AND operation (∧) to check if there is a common transaction involving the joined itemset.
To do this, for an itemset $X$, a bit representation for all transactions is required.
Let $b(X)$ be the bit representation of $X$. As shown in Fig. 2, if a transaction containing $X$ is set to 1, otherwise, it is set to 0. Consequently,
a joined itemset $X$ of $X_{i}$ and $X_{j}$ is a joinable itemset if $b(X_{i})∧ b(X_{j})≥
1$. In addition, if $TWU(X)≥$min_util, then the ULS of $X$ is finally constructed.
Fig. 2. Example of Bitwise AND operation and n-item utility list structure
We explain the proposed HUI mining process with our examples. Fig. 2(a) shows bit representations of $\{e\}$ and $\{c\}$, and $b(e)∧ b(c)= 0010 ≥ 1$. Hence,
a joined itemset $\{e,\: c\}$ is a joinable itemset. After the bit operation, common
transactions that are corresponding to 1 in the bitwise AND result are determined.
Since only $T_{3}$ is 1 in the bitwise AND result, $TWU(ec)$ is calculated only for
$T_{3}$. Thus, $TWU(ec)= tu(T_{3})=$ $28 ≥$min_util. So, ULS of $\{e,\: c\}$ is constructed.
$iu(ec,\: T_{3})= 8$, $ru(ec,\: T_{3})= 20$, and $cu(ec)= iu(e)= 6$. $iu(e)$ can be
obtained from the ULS of $\{e\}$. Here, since $iu(ec)= 8 ≤$min_util, $\{e,\: c\}$
is not a high utility itemset. However, since $iu(ec,\: T_{3})+ ru(ec,\:$ $T_{3})=
28 ≥$min_util, it is still a joinable itemset. In Fig. 2(b), since $b(c)∧ b(b)= 0110 ≥ 1$, $\{c,\: b\}$ is a joinable itemset. Since common transactions
are $T_{2}$ and $T_{3}$, $TWU(cb)= tu(T_{2})+$ $tu(T_{3})= 50 ≥$min_util. Therefore,
ULS of $\{c,\: b\}$ is created. Since $iu(cb)= iu(cb,\: T_{2})+ iu(cb,\: T_{3})= 14
+ 20 = 34 ≥$min_util, it is added to SHU and joinable itemsets. In Fig. 2(c), $b(b)∧$ $b(d)$$= 0110 ≥ 1$ and $TWU(bd)= tu(T_{2})+ tu(T_{3})= 50 ≥$min_util. ULS
of $\{b,\: d\}$ is constructed. Since $iu(bd)= 34 ≥$min_util4, $\{b,\: d\}$ is added
to SHU and joinable itemsets. Fig. 2(d) shows the generation of ULS of 3-item $\{e,\:c,\:b\}$ from $\{e,\: c\}$ and $\{c,\:
b\}$. $b(ec)$$∧ b(cb)= 0010 ≥ 1$ and $TWU(ecb)= tu(T_{3})= 28 ≥$min_util. Since $iu(ecb)=
iu(ec)+ iu(cb)- cu(cb)= 8 + 20 - 2 = 26 ≥$min_util, it is added to SHU and joinable
itemsets. $cu(ecb)$ is determined as $cu(ecb)= iu(ec)=8$. In 3-item $\{c,\:b,\:d\}$
of Fig. 2(e), $b(cb)∧$ $b(bd)= 0110$, common transactions $T_{2}$ and $T_{3}$ exist. Since $iu(cbd)=
iu(cb)+ iu(bd)- cu(bd)= 34 + 34 - 30 = 38 ≥$min_util$, it is added to HUI and joinable
itemsets. Fig. 2(f) shows ULS of 4-item $\{e,\:c,\:b,\:d\}$. Since $iu(ecbd)= iu(ecb)+ iu(cbd)- cu(cbd)=$$26$
$+ 22 - 20 = 28 ≥$min_util, it is a high utility itemset and a joinable itemset, as
well. $cu(ecbd)= iu(ecb)= 26$. The Mining process stops since no common transactions
exist by bitwise AND operation with 4-item $\{e,\:c,\:b,\:d\}$ and $\{a,\:c,\:b,\:d\}$,
which are high utility itemsets.
The detailed descriptions of the BHUI and HUI_mining algorithms are given below. BHUI
gets two parameters D and min_util as inputs, and initiates our mining process. BHUI
initially scans D to create a bitmap to indicate whether each item exists in the transaction,
and then, computes the TWU and item utility of each item in I. Then, the items are
sorted by TWU in an ascending order. The TWU of each item is used to discard unpromising
items from further processing. If TWU of each sorted itemset is greater than min_util,
then ULS of the itemset is constructed. The ULS of 1-item are used to explore the
search space and to mine the next level of high utility itemsets. After ULS of 1-item
is constructed, the HUI_mining is performed repeatedly so as to mine a set of high
utility itemsets (HUI). It performs bitwise AND operation for two itemsets in ULS
to create a joined itemset and checks if the joined itemset satisfies min_util. It
creates ULS of meaningful joined itemsets and outputs new high utility itemsets. The
HUI_mining is repeated until no more joined itemset is created. Finally, HUI includes
all high utility itemsets.