Ome improvements about a unified hierarchy for functional, conditional functional dependencies and association rules - Vuc Quốc Tuấn

Tài liệu Ome improvements about a unified hierarchy for functional, conditional functional dependencies and association rules - Vuc Quốc Tuấn: Research Journal of Military Science and Technology, Special Issue, No.60A, 05 - 2019 87 SOME IMPROVEMENTS ABOUT A UNIFIED HIERARCHY FOR FUNCTIONAL, CONDITIONAL FUNCTIONAL DEPENDENCIES AND ASSOCIATION RULES Vu Quoc Tuan* Abstract: In [1], the authors have shown a hierarchy between Functional Dependencies (FDs), Conditional Functional Dependencies (CFDs) and Association Rules (ARs): FDs are the union of CFDs while CFDs are the union of ARs. The hierarchy between FDs, CFDs, and ARs has many benefits: algorithms for discovering ARs can be adapted to discover many other types of data dependencies and further generate a reduced set of dependencies. In this paper, we give some remarks and improvements related to the uniﬁed hierarchy for FDs, CFDs and ARs. Keywords: Database; Functional Dependency; Conditional Functional Dependency; Association Rule. 1. INTRODUCTION Data dependencies play important roles in database design, data quality management and knowledge ...

8 trang | Chia sẻ: quangot475 | Lượt xem: 505 | Lượt tải: 0

Bạn đang xem nội dung tài liệu Ome improvements about a unified hierarchy for functional, conditional functional dependencies and association rules - Vuc Quốc Tuấn, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên

Research Journal of Military Science and Technology, Special Issue, No.60A, 05 - 2019 87 SOME IMPROVEMENTS ABOUT A UNIFIED HIERARCHY FOR FUNCTIONAL, CONDITIONAL FUNCTIONAL DEPENDENCIES AND ASSOCIATION RULES Vu Quoc Tuan* Abstract: In [1], the authors have shown a hierarchy between Functional Dependencies (FDs), Conditional Functional Dependencies (CFDs) and Association Rules (ARs): FDs are the union of CFDs while CFDs are the union of ARs. The hierarchy between FDs, CFDs, and ARs has many benefits: algorithms for discovering ARs can be adapted to discover many other types of data dependencies and further generate a reduced set of dependencies. In this paper, we give some remarks and improvements related to the uniﬁed hierarchy for FDs, CFDs and ARs. Keywords: Database; Functional Dependency; Conditional Functional Dependency; Association Rule. 1. INTRODUCTION Data dependencies play important roles in database design, data quality management and knowledge representation. Nowadays, along with the development of the society and digital devices, especially social networks and smartphone applications, the amount of data in the applications increases very quickly. These arise data problems such as storage, management and knowledge discovery from large data sets. Discovering data dependencies in a database is one of important problems of knowledge discovery. Three essential types of data dependencies are FDs (Functional Dependencies), CFDs (Conditional Functional Dependencies), and ARs (Association Rules). The authors in [1] show a hierarchy between FDs, CFDs and ARs: FDs are the union of CFDs while CFDs are the union of ARs. This hierarchy has some benefits: existing algorithms which discover ARs could be adapted to discover any kind of dependencies and, moreover, generate a reduced set of dependencies. In this paper, we give some remarks and improvements related to the hierarchy between FDs, CFDs and ARs. This paper is organized as: the second section recalls some concepts and definitions. The third section gives some remarks and improvements. The conclusions are presented in the fourth section. 2. BACKGROUND AND DEFINITIONS 2.1. Relation A relation on the set of attributes Ω = {A1, A2,,An} is a subset of the Cartesian product Dom(A1)  Dom(A2)   Dom(An), where Dom(Ai) is the domain of Ai, i = 1, 2,, n. The elements of the relation are called tuples. Let r be a relation on the set of attributes Ω = {A1, A2,,An}. Then r  {(a1, a2,,an) | ai  Dom(Ai), i = 1, 2,, n} 2.2. Relation schema A relation schema R is a ordered pair R = , where  is a finite set of attributes of the relation, F is a set of constraints involving the attributes. In this paper, we limit F to consist only of FDs. Let us denote: Information technology & Applied mathematics Vu Quoc Tuan, “Some improvements about dependencies and association rules.” 88 R() is instead of R = in some cases, r(R) refers to a relation r of R, t[X] is the projection of the tuple t on X (t  r). 2.3. Functional dependency (FD) A functional dependency over a relation schema R() is an expression X  Y, where X and Y are subsets of . The dependency holds or is valid in a given relation r(R) if  t1, t2  r, t1[X] = t2[X]  t1[Y] = t2[Y] 2.4. Armstrong's axioms For each X, Y, Z  , A1. (Reflexivity): If Y  X then X  Y. A2. (Augmentation): If X  Y then XZ  YZ. A3. (Transitivity): If X  Y and Y  Z then X  Z. Let F be a given set of FDs, the closure F+ of F is the set of FDs which can be derived from F through Armstrong's reference rules. 2.5. The closure of a set of attributes Let F be a given set of FDs over the set of attributes  and X  . The closure of X under F, denoted by FX  , is a set defined as: FX  = {A    (X  A)  F+} From the definition of FX  and Armstrong's reference rules, it is easily seen that X  Y holds if and only if Y  FX  . 2.6. Conditional Functional Dependency (CFD) The concept of CFD [2, 3, 4, 7, 8] is an extension of the FD concept for data cleaning. Let R() be a relation schema and X, Y  . A CFD  on R is a pair (X  Y, Tp), where (1) X  Y is a standard FD referred to as the FD embedded in  and (2) Tp is a tableau with all attributes in X  Y, referred to as the pattern tableau of , where for each A  X  Y and each tuple t  Tp, t[A] is either a constant "a" in the domain Dom(A), or an unnamed variable "". Intuitively, the pattern tableau Tp of  refines the standard FD embedded in  by enforcing the binding of semantically related data values. 2.7. Association Rule (AR) Let R() be a relation schema and X1, ..., Xk, A  . An AR ([1]) is a dependency (X1 = b1)  ...  (Xk = bk)  (A = a), meaning that for any tuple t  r(R), if t[X1] = b1 and ... and t[Xk] = bk then t[A] = a. 2.8. Definitions Let R be a relation schema defined over a set of attributes  = Attr(R). Authors in [1] have given definitions below. Research Journal of Military Science and Technology, Special Issue, No.60A, 05 - 2019 89 Definition 2.8.1 (CFDs definition). A CFD  defined on R is a pair (X  Y, Tp), where (1) X  Y is a standard FD, referred to as the FD embedded in ; and (2) Tp is a tableau with attributes in R, referred to as the pattern tableau of , where for each A  Attr(R) and each pattern tuple tp  Tp, tp[A] is either: - a constant "a" in Dom(A), - an unnamed variable  that draws values from Dom(A), - or an empty variable  which indicates that the attribute does not contribute to the pattern (i.e. A  X  Y). Definition 2.8.2 ( definition). For any constant a of an attribute, we have   a  . We define the pattern intersection operator  of two tuples as: t1  t2 = tp such that A  Attr(R), 1 1 2 2 1 2 [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] , p p p t A t A t A t A t A t A t A t A t A          if if otherwise Definition 2.8.3 (Pattern restriction definition). We define the pattern restriction to attributes X of a tuple t, denoted by t(X) as: t(X) = tp such that A  Attr(R), [ ] [ ] [ ] p p t A t A t A       if A X if A X Definition 2.8.4 ( definition). A subsumption operator  over pattern tuples t1 and t2 defined as: t1  t2 if and only if A  Attr(R), t1[A]  t2[A] In other words, t1  t2 if and only if t1  t2 = t1. Definition 2.8.5 (r   definition). An instance r of R satisfies the CFD , denoted by r  , if for each tuple tp in the pattern tableau Tp of , and for each pair of tuples t1, t2 in r, if t1[X] = t2[X]  tp[X] then t1[Y] = t2[Y]  tp[Y] In other words, a CFD is an FD satisfied by a fragment relation. A pattern tableau can thus be seen as a selection query on a relation returning a fragment of the relation. Example 2.1. Consider the relation Q in Table 1: Table 1. Relation Q. Q A B C D t1 001 124 12 34 t2 001 124 21 43 t3 002 157 34 69 t4 002 157 34 61 Information technology & Applied mathematics Vu Quoc Tuan, “Some improvements about dependencies and association rules.” 90 t5 002 158 89 62 t6 002 158 89 90 We have: CFD  = (AB  C, {002, , , }) is satisfied by the relation Q, t1  t2 = (001, 124, 12, 34), t1  t2 and t1(BC) = (, 124, 12, ). Definition 2.8.6. A pattern tuple tp defines a fragment relation of r: pt r = {t  r | tp  t} The authors of [1] have successfully exploited the properties of X-complete relations, introduced in [6], and established the link between FDs, CFDs and ARs on that foundation. In section 2.2 of [1], there are the following definitions: Definition 2.8.7 (X-complete property). The relation r is said to be X-complete if and only if  t1, t2  r we have t1[X] = t2[X]. Example 2.2. The relation in Table 2 is A-complete. Definition 2.8.8 (X-complete pattern). We call X-complete pattern of an X- complete relation r, denoted by (X, r), the pattern tuple on which tuples of r agree. More formally: (X, r) =  {t  r} Definition 2.8.9 (X-complete horizontal decomposition). We denote by RX(r) the set of all X-complete fragment relations of r. More formally: RX(r) = {r'  r | r' is X-complete} Definition 2.8.10 (Set of X-patterns). We denote by (X, r) the set of all X- complete patterns of an X-complete decomposition. More formally: (X, r) = {(X, r') | r'  RX(r)} Definition 2.8.11 (Closure operator). We call closure of X in r, denoted by (X, r), the set of all attributes defined in all X-complete patterns of the relation. More formally: (X, r) = {A  Attr(R) | tp  (X, r), tp[A]  } 3. SOME REMARKS AND IMPROVEMENTS Remark 3.1. The definition of a CFD in [1] is different from the standard one, given for instance in [2, 3] by the extension for all tp  Tp which are defined in Attr(R), instead only in (X  Y)  Attr(R). This is done by assigning tp[A] =  for all A  (X  Y). Such extension makes some easy for the matching of two tuples but with the price by using the pattern restriction to attributes X (resp. Y) of a tuple t. Remark 3.2. The matching of a tuple t  r with a tuple tp  Tp (tp is now defined on Attr(R)) is in fact the matching of t(X  Y) and tp(X  Y) to know whether there are assignments appropriate constant a to  (in tp) such that they will be identical. More formally, t(X) and tp(X) (respectively t(Y) and tp(Y)) are matching if Research Journal of Military Science and Technology, Special Issue, No.60A, 05 - 2019 91 A  X: t(X)[A] = tp(X)[A] = a  Dom(A) or t(X)[A] = a and tp(X)[A] =  Remark 3.3. In Definition 3.6, there is the following statement: A pattern tuple tp defines a fragment relation of r: pt r = {t  r | tp  t} (*) Due to the order relation   a   for any constant a of an attribute, it is clear that the formula (*) is incorrect. The reason is that in almost cases, (*) returns empty set. In fact, in case tp contains at least one component , then there exists not t  r such that tp  t. In the opposite case (tp does not contain the component ) and X  Y  Attr(R), we have tp[A] =  and t[A] = a for A  X  Y. Therefore, there does not exist t  r such that tp  t. So, pt r , defined by (*) returns the non-empty result only when X  Y = Attr(R) and tp coincides with some tuple t in r. Hence, the expression (*) must be changed to pt r = {t  r | t(X  Y)  tp(X  Y)} (**) Example 3.1. Consider the relation Q in Table 1 and the pattern tuple tp = (002, , , ). Then pt r is the fragment relation in Table 2. Table 2. A fragment relation pt r . pt r A B C D t3 002 157 34 69 t4 002 157 34 61 t5 002 158 89 62 t6 002 158 89 90 Remark 3.4. Let r is an X-complete relation and r'  r. Then (X, r') =  {t  r'} Returning to the Definition 3.2 of  on two tuples t1, t2  r. Here, using the order relation   a   for any constant a of an attribute to compute t1  t2, we can have unnecessary difficulties which are time consuming. Essentially, we have to compare the respective components of two tuples t1 and t2 to know whether they are equal or not. Therefore, instead of the operator , it is better to use the simple operator  , defined as follows: With t1, t2  r, Information technology & Applied mathematics Vu Quoc Tuan, “Some improvements about dependencies and association rules.” 92 t1  t2 = t such that A  Attr(R), 1 1 2 1 2 [ ] [ ] [ ] [ ] [ ] [ ] [ ] t A t A A t A t A A t A       if t if t Example 3.2. To illustrate, we consider the relation r and relations r1, r2, r3, r4 as: Table 3. The relation r. r A B C D E G t1 a1 b1 c1 d1 e1 g1 t2 a2 b1 c1 d1 e2 g2 t3 a2 b4 c4 d1 e6 g2 t4 a2 b2 c2 d1 e3 g2 t5 a3 b2 c2 d2 e4 g3 t6 a3 b3 c3 d2 e5 g3 t7 a4 b3 c3 d3 e6 g4 t8 a1 b1 c1 d1 e3 g1 The relation r is an instance of the relation schema R({A, B, C, D, E, G}). It is easily seen that r satisfies the set of FDs F = {B  C, C  B, A  DG}. The following relations r1, r2, r3, r4 are fragment relations of r and A-complete. Table 4. The relation r2. r1 A B C D E G t1 a1 b1 c1 d1 e1 g1 t8 a1 b1 c1 d1 e3 g1 (A, r1) =  {t  r1} = (a1, b1, c1, d1, , g1) Table 5. The relation r2. r2 A B C D E G t2 a2 b1 c1 d1 e2 g2 t3 a2 b4 c4 d1 e6 g2 t4 a2 b2 c2 d1 e3 g2 (A, r2) =  {t  r2} = (a2, , , d1, , g2) Table 6. The relation r3. r3 A B C D E G t5 a3 b2 c2 d2 e4 g3 t6 a3 b3 c3 d2 e5 g3 (A, r3) =  {t  r3} = (a3, , , d2, , g3) Table 7. The relation r4. r4 A B C D E G t7 a4 b3 c3 d3 e6 g4 (A, r4) =  {t  r4} = (a4, b3, c3, d3, e6, g4) The set of A-complete pattern tuples of r as: (A, r) = {(a1, b1, c1, d1, , g1), (a2, , , d1, , g2), Research Journal of Military Science and Technology, Special Issue, No.60A, 05 - 2019 93 (a3, , , d2, , g3), (a4, b3, c3, d3, e6, g4)} By the elements of (A, r), we have (A, r) = {A, D, G}. Remark 3.5. In practice, to compute the X-complete-pattern ( , )X r of an X- complete relation r, i.e. to compute  t r  , we can use the following method: ( )A Attr R  ,  ( , )[ ]X r A = , , the column A of r contains an unique value a otherwise a if   More formally ( )A Attr R  , ( , )[ ]X r A = , ( ) ( ) , otherwise Aa if r a    Remark 3.6. Basing on definitions and properties of (X, r) and FX  , we have the proposition as: Let r be a relation of R defined on Attr(R), X  Attr(R), and r satisfies a set of FDs F. Then (X, r) = {A  Attr(R) | tp  (X, r), tp[A]  } = FX  = {A  Attr(R) | (X  A)  F+}, where (X, r) = {(X, r') | r'  RX(r)} and (X  A)  F +. Proof. A  (X, r)  tp  (X, r), we have tp[A]  . Therefore,  t, t'  r such that t[X] = t'[X]  t[A] = t'[A]. So X  A  F+ or A  FX  . Conversely, suppose that A  FX   X  A  F+. Then,  r'  RX(r) and  t, t'  r', we have t[X] = t'[X]. Since X  A  F+ then t[A] = t'[A]  (X, r')[A]    A  (X, r). The above result is a tool for discovering FDs from (X, r), which means that X  Y if and only if Y  FX  = (X, r). Example 3.3. From (A, r) = {A, D, G} in Example 3.1, we have A  D and A  G. 4. CONCLUSIONS In this paper, we give some remarks and preliminary results concerning a work about a unified hierarchy for FDs, CFDs and ARs: adjusting the expression pt r which defines a fragment relation; proposing an improvement for the operator  to  which is more rigorous semantically and computed easily; proving that ( , ) FX r X  . Studying and solving problems related to FDs, CFDs and ARs is a new and very interesting direction. REFERENCES [1]. R. Medina, and L. Nourine, "A Uniﬁed Hierarchy for Functional Dependencies, Conditional Functional Dependencies and Association Rules", ICFCA 2009, LNAI 5548, pp. 98–113, (2009). Information technology & Applied mathematics Vu Quoc Tuan, “Some improvements about dependencies and association rules.” 94 [2]. P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, "Conditional functional dependencies for data cleaning", IEEE 23rd International Conference on Data Engineering, pages 746–755, (2007). [3]. W. Fan, F. Geerts, J. Li, and M. Xiong, "Discovering conditional functional dependencies", IEEE Transactions on Knowledge and Data Engineering, Volume 23, Issue 5, pp. 683-698, (2011). [4]. P. Z. Yeh, and C. A. Puri, "Discovering conditional functional dependencies to detect data inconsistencies", ACM Transactions on Database Systems (TODS), Volume 33, Issue 2, Article No. 6, (2008). [5]. W. Fan, and F. Geerts, "Foundations of Data Quality Management", Morgan & Claypool Publisher, (2012). [6]. P. D. Bra, and J. Paredaens, "An algorithm for horizontal decompositions", Information Processing Letters, 17(2), pp. 91-95, (1983). [7]. R. Salem and A. Abdo, "Fixing rules for data cleaning based on conditional functional dependency". Future Computing and Informatics Journal, 1(1-2), 10–26.doi:10.1016/j.fcij.2017.03.002. [8]. J. Rammelaere and F. Geerts, "Revisiting Conditional Functional Dependency Discovery: Splitting the “C” from the “FD”", Machine Learning and Knowledge Discovery in Databases pp 552-568, (2018). [9]. T. Diallo, N. Novelli and J.M. Petit, "Discovering (frequent) constant conditional functional dependencies". IJDMMM 4(3), 205–223, (2012). TÓM TẮT MỘT SỐ CẢI TIẾN VỀ SỰ PHÂN CẤP GIỮA PHỤ THUỘC HÀM, PHỤ THUỘC HÀM ĐIỀU KIỆN VÀ LUẬT KẾT HỢP Trong [1], các tác giả đã chỉ ra một thứ tự phân cấp giữa các phụ thuộc hàm (FD), phụ thuộc hàm điều kiện (CFD) và luật kết hợp (AR): FD là hợp của các CFD trong khi CFD là hợp của các AR. Thứ tự phân cấp giữa các FD, CFD và AR mang lại nhiều lợi ích: các thuật toán hiện có để phát hiện AR có thể được thích nghi để phát hiện nhiều loại phụ thuộc dữ liệu khác và hơn nữa còn sinh một tập được rút gọn các phụ thuộc. Trong bài báo này, chúng tôi đưa ra một số nhận xét và cải tiến có liên quan tới thứ tự phân cấp giữa FD, CFD và AR. Từ khóa: Cơ sở dữ liệu; Phụ thuộc hàm; Phụ thuộc hàm điều kiện; Luật kết hợp. Received 17th January 2019 Revised 16th April 2019 Accepted 15th May 2019 Author affiliations: Graduate University of Science and Technology, Vietnam Academy of Science and Technology; *Corresponding author: vqtuanhd@gmail.com.

Các file đính kèm theo tài liệu này:

10_tuan_9489_2150387.pdf