Luận văn Hidden topic discovery toward classification and clustering in vietnamese web documents

Tài liệu Luận văn Hidden topic discovery toward classification and clustering in vietnamese web documents: VIET NAM NATIONAL UNIVERSITY, HANOI COLLEGE OF TECHNOLOGY NGUYEN CAM TU HIDDEN TOPIC DISCOVERY TOWARD CLASSIFICATION AND CLUSTERING IN VIETNAMESE WEB DOCUMENTS MASTER THESIS HANOI - 2008 VIET NAM NATIONAL UNIVERSITY, HANOI COLLEGE OF TECHNOLOGY NGUYEN CAM TU HIDDEN TOPIC DISCOVERY TOWARD CLASSIFICATION AND CLUSTERING IN VIETNAMESE WEB DOCUMENTS Major: Information Technology Specificity: Information Systems Code: 60 48 05 MASTER THESIS SUPERVISOR: Prof. Dr. Ha Quang Thuy HANOI - 2008 i Acknowledgements My deepest thank must first go to my research advisor, Prof. Dr. Ha Quang Thuy, who offers me an endless inspiration in scientific research, leading me to this research area. I particularly appreciate his unconditional support and advice in both academic environment and daily life during the last four years. Many thanks go to Dr. Phan Xuan Hieu who has given me many advices and comments. This work can not be possible without his supp...

67 trang | Chia sẻ: hunglv | Lượt xem: 2761 | Lượt tải: 0Free

Bạn đang xem trước 20 trang mẫu tài liệu Luận văn Hidden topic discovery toward classification and clustering in vietnamese web documents, để tải tài liệu gốc về máy bạn click vào nút DOWNLOAD ở trên

VIET NAM NATIONAL UNIVERSITY, HANOI COLLEGE OF TECHNOLOGY NGUYEN CAM TU HIDDEN TOPIC DISCOVERY TOWARD CLASSIFICATION AND CLUSTERING IN VIETNAMESE WEB DOCUMENTS MASTER THESIS HANOI - 2008 VIET NAM NATIONAL UNIVERSITY, HANOI COLLEGE OF TECHNOLOGY NGUYEN CAM TU HIDDEN TOPIC DISCOVERY TOWARD CLASSIFICATION AND CLUSTERING IN VIETNAMESE WEB DOCUMENTS Major: Information Technology Specificity: Information Systems Code: 60 48 05 MASTER THESIS SUPERVISOR: Prof. Dr. Ha Quang Thuy HANOI - 2008 i Acknowledgements My deepest thank must first go to my research advisor, Prof. Dr. Ha Quang Thuy, who offers me an endless inspiration in scientific research, leading me to this research area. I particularly appreciate his unconditional support and advice in both academic environment and daily life during the last four years. Many thanks go to Dr. Phan Xuan Hieu who has given me many advices and comments. This work can not be possible without his support. Also, I would like to thank him for being my friend, my older brother who has brought me a lot of lessons in both scientific research and daily life. My thanks also go to all members of seminar group “data mining”. Especially, I would like to thank Bsc. Nguyen Thu Trang for helping me a lot in collecting data and doing experiments. I highly acknowledge the invaluable support and advice in both technical and daily life of my teachers, my colleagues in Department of Information Systems, Faculty of Technology, Vietnam National University, Hanoi I also want to thank the supports from the Project QC.06.07 “Vietnamese Named Entity Resolution and Tracking crossover Web Documents”, Vietnam National University, Hanoi; the Project 203906 “`Information Extraction Models for finding Entities and Semantic Relations in Vietnamese Web Pages'' of the Ministry of Science and Technology, Vietnam; and the National Project 02/2006/HĐ - ĐTCT-KC.01/06-10 “Developing content filter systems to support management and implementation public security – ensure policy” Finally, from bottom of my heart, I would specially like to say thanks to all members in my family, all my friends. They are really an endless encouragement in my life. Nguyen Cam Tu ii Assurance I certify that the achievements in this thesis belong to my personal, and are not copied from any other’s results. Throughout the dissertation, all the mentions are either my proposal, or summarized from many sources. All the references have clear origins, and properly quoted. I am responsible for this statement. Hanoi, November 15, 2007 Nguyen Cam Tu iii Table of Content Introduction ..........................................................................................................................1 Chapter 1. The Problem of Modeling Text Corpora and Hidden Topic Analysis ...............3 1.1. Introduction ...............................................................................................................3 1.2. The Early Methods ....................................................................................................5 1.2.1. Latent Semantic Analysis ...................................................................................5 1.2.2. Probabilistic Latent Semantic Analysis..............................................................8 1.3. Latent Dirichlet Allocation......................................................................................11 1.3.1. Generative Model in LDA................................................................................12 1.3.2. Likelihood.........................................................................................................13 1.3.3. Parameter Estimation and Inference via Gibbs Sampling................................14 1.3.4. Applications......................................................................................................17 1.4. Summary..................................................................................................................17 Chapter 2. Frameworks of Learning with Hidden Topics..................................................19 2.1. Learning with External Resources: Related Works ................................................19 2.2. General Learning Frameworks ................................................................................20 2.2.1. Frameworks for Learning with Hidden Topics ................................................20 2.2.2. Large-Scale Web Collections as Universal Dataset .........................................22 2.3. Advantages of the Frameworks ...............................................................................23 2.4. Summary..................................................................................................................23 Chapter 3. Topics Analysis of Large-Scale Web Dataset ..................................................24 3.1. Some Characteristics of Vietnamese .......................................................................24 3.1.1. Sound................................................................................................................24 3.1.2. Syllable Structure .............................................................................................26 3.1.3. Vietnamese Word .............................................................................................26 3.2. Preprocessing and Transformation..........................................................................27 3.2.1. Sentence Segmentation.....................................................................................27 iv 3.2.2. Sentence Tokenization......................................................................................28 3.2.3. Word Segmentation ..........................................................................................28 3.2.4. Filters ................................................................................................................28 3.2.5. Remove Non Topic-Oriented Words ...............................................................28 3.3. Topic Analysis for VnExpress Dataset ...................................................................29 3.4. Topic Analysis for Vietnamese Wikipedia Dataset ................................................30 3.5. Discussion................................................................................................................31 3.6. Summary..................................................................................................................32 Chapter 4. Deployments of General Frameworks ..............................................................33 4.1. Classification with Hidden Topics ..........................................................................33 4.1.1. Classification Method.......................................................................................33 4.1.2. Experiments ......................................................................................................36 4.2. Clustering with Hidden Topics................................................................................40 4.2.1. Clustering Method ............................................................................................40 4.2.2. Experiments ......................................................................................................45 4.3. Summary..................................................................................................................49 Conclusion ..........................................................................................................................50 Achievements throughout the thesis...............................................................................50 Future Works ..................................................................................................................50 References ..........................................................................................................................52 Vietnamese References ..................................................................................................52 English References .........................................................................................................52 Appendix: Some Clustering Results...................................................................................56 v List of Figures Figure 1.1. Graphical model representation of the aspect model in the asymmetric (a) and symmetric (b) parameterization. ( [55]) ...............................................................................9 Figure 1.2. Sketch of the probability sub-simplex spanned by the aspect model ( [55])...10 Figure 1.3. Graphical model representation of LDA - The boxes is “plates” representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document [20] ............................................12 Figure 1.4. Generative model for Latent Dirichlet allocation; Here, Dir, Poiss and Mult stand for Dirichlet, Poisson, Multinomial distributions respectively.................................13 Figure 1.5. Quantities in the model of latent Dirichlet allocation......................................13 Figure 1.6. Gibbs sampling algorithm for Latent Dirichlet Allocation..............................16 Figure 2.1. Classification with Hidden Topics...................................................................20 Figure 2.2. Clustering with Hidden Topics ........................................................................21 Figure 3.1. Pipeline of Data Preprocessing and Transformation .......................................27 Figure 4.1. Classification with VnExpress topics .............................................................33 Figure 4.2 Combination of one snippet with its topics: an example ..................................35 Figure 4.3. Learning with different topic models of VnExpress dataset; and the baseline (without topics)...................................................................................................................37 Figure 4.4. Test-out-of train with increasing numbers of training examples. Here, the number of topics is set at 60topics .....................................................................................37 Figure 4.5 F1-Measure for classes and average (over all classes) in learning with 60 topics...................................................................................................................................39 Figure 4.6. Clustering with Hidden Topics ........................................................................40 Figure 4.7. Dendrogram in Agglomerative Hierarchical Clustering..................................42 Figure 4.8 Precision of top 5 (and 10, 20) in best clusters for each query.........................47 Figure 4.9 Coverage of the top 5 (and 10) good clusters for each query ...........................47 vi List of Tables Table 3.1. Vowels in Vietnamese.......................................................................................24 Table 3.2. Tones in Vietnamese .........................................................................................25 Table 3.3. Consonants of hanoi variety ..............................................................................26 Table 3.4. Structure of Vietnamese syllables ....................................................................26 Table 3.5. Functional words in Vietnamese .......................................................................29 Table 3.6. Statistics of topics assigned by humans in VnExpress Dataset.........................29 Table 3.7. Statistics of VnExpress dataset .........................................................................30 Table 3.8 Most likely words for sample topics. Here, we conduct topic analysis with 100 topics...................................................................................................................................30 Table 3.9. Statistic of Vietnamese Wikipedia Dataset ......................................................31 Table 3.10 Most likely words for sample topics. Here, we conduct topic analysis with 200 topics...................................................................................................................................31 Table 4.1 Google search results as training and testing dataset. The search phrases for training and test data are designed to be exclusive ............................................................34 Table 4.2. Experimental results of baseline (learning without topics)...............................38 Table 4.3. Experimental results of learning with 60 topics of VnExpress dataset.............38 Table 4.4. Some collocations with highest values of chi-square statistic ..........................44 Table 4.5. Queries submitted to Google.............................................................................45 Table 4.6. Parameters for clustering web search results ....................................................46 vii Notations & Abbreviations Word or phrase Abbreviation Information Retrieval IR Latent Semantic Analysis LSA Probability Latent Semantic Analysis PLSA Latent Dirichlet Allocation LDA Dynamic Topic Models DTM Correlated Topic Models CTM Singular Value Decomposition SVD 1 Introduction The World Wide Web has influenced many aspects of our lives, changing the way we communicate, conduct business, shop, entertain, and so on. However, a large portion of the Web data is not organized in systematic and well structured forms, a situation which causes great challenges to those seeking for information on the Web. Consequently, a lot of tasks enabling users to search, navigate and organize web pages in a more effective way have been posed in the last decade, such as searching, page rank, web clustering, text classification, etc. To this end, there have been a lot of successful stories like Google, Yahoo, Open Directory Project (Dmoz), Clusty, just to name but a few. Inspired by this trend, the aim of this thesis is to develop efficient systems which are able to overcome the difficulties of dealing with sparse data. The main motivation is that while being overwhelmed by a huge amount of online data, we sometimes lack data to search or learn effectively. Let take web search clustering as an example. In order to meet the real-time condition, that is the response time must be short enough, most of online clustering systems only work with small pieces of text returned from search engines. Unfortunately those pieces are not long and rich enough to build a good clustering system. A similar situation occurs in the case of searching images only based on captions. Because image captions are only very short and sparse chunks of text, most of the current image retrieval systems still fail to achieve high accuracy. As a result, much effort has been made recently to take advantage of external resources like learning with knowledge-base support, semi-supervised learning, etc. in order to improve the accuracy. These approaches, however, have some difficulties: (1) constructing a knowledge base is very time-consuming & labor-intensive, and (2) the results of semi-supervised learning in one application cannot be reused in another one even in the same domain. In the thesis, we introduce two general frameworks for learning with hidden topics discovered from large-scale data collections: one for clustering and another for classification. Unlike semi-supervised learning, we approach this issue from the point of view of text/web data analysis that is based on recently successful topic analysis models, such as Latent Semantic Analysis, Probabilistic-Latent Semantic Analysis, or Latent Dirichlet Allocation. The underlying idea of the frameworks is that for a domain we collect a very large external data collection called “universal dataset”, and then build the learner on both the original data (like snippets or image captions) and a rich set of hidden topics discovered from the universal data collection. The general frameworks are flexible 2 and general enough to apply for a wide range of domains and languages. Once we analyze a universal dataset, the resulting hidden topics can be used for several learning tasks in the same domain. This is also particularly useful for sparse data mining. Sparse data like snippets returned from a search engine can be expanded and enriched with hidden topics. Thus, a better performance can be achieved. Moreover, because the method can learn with smaller data (the meaningful hidden topics rather than all unlabeled data), it requires less computational resources than semi-supervised learning. Roadmap: The organization of this thesis is follow Chapter 1 reviews some typical topic analysis methods such as Latent Semantic Analysis, Probabilistic Latent Semantic Analysis, and Latent Dirichlet Allocation. These models can be considered the basic building blocks of general framework of probabilistic modeling of text and be used to develop more sophisticated and application-oriented models, such as hierarchical models, author-role models, entity models, and so on. They can also be considered key components in our proposals in subsequent chapters. Chapter 2 introduces two general frameworks for learning with hidden topics: one for classification and one for clustering. These frameworks are flexible and general enough to apply in many domains of applications. The key common phrase between the two frameworks is topic analysis for large-scale collections of web documents. The quality of the hidden topic described in this chapter will much influence the performance of subsequent stages. Chapter 3 summarizes some major issues for analyzing data collections of Vietnamese documents/Web pages. We first review some characteristics of Vietnamese which are considered significant for data preprocessing and transformation in the subsequent processes. Next, we discuss more details about each step of preprocessing and transforming data. Important notes, including specific characteristics of Vietnamese are highlighted. Also, we demonstrate the results from topic analysis using LDA for the clean, preprocessed dataset. Chapter 4 describes the deployments of general frameworks proposed in Chapter 2 for 2 tasks: search result classification, and search result clustering. The two implementations are based on the topic model analyzed from a universal dataset like shown in chapter 3. The Conclusion sums up the achievements throughout the previous four chapters. Some future research topics are also mentioned in this section. 3 Chapter 1. The Problem of Modeling Text Corpora and Hidden Topic Analysis 1.1. Introduction The goal of modeling text corpora and other collections of discrete data is to find short description of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basis tasks such as classification, clustering, summarization, and similarity and relevance judgments. Significant achievements have been made on this problem by researchers in the context of information retrieval (IR). Vector space model [48] (Salton and McGill, 1983) – a methodology successfully deployed in modern search technologies - is a typical approach proposed by IR researchers for modeling text corpora. In this model, documents are represented as vectors in a multidimensional Euclidean space. Each axis in this space corresponds to a term (or word). The i-th coordinate of a vector represents some functions of times of the i-th term occurs in the document represented by the vector. The end result is a term-by-document matrix X whose columns contain the coordinates for each of the documents in the corpus. Thus, this model reduces documents of arbitrary length to fixed- length lists of numbers. While the vector space model has some appealing features – notably in its basis identification of sets of words that are discriminative for documents in the collection – the approach also provides a relatively small amount of reduction in description length and reveals little in the way of inter- or intra- document statistical structure. To overcome these shortcomings, IR researchers have proposed some other modeling methods such as generalized vector space model, topic-based vector space model, etc., among which latent semantic analysis (LSA - Deerwester et al, 1990)[13][26] is the most notably. LSA uses a singular value decomposition of the term-by-document X matrix to identify a linear subspace in the space of term weight features that captures most of variance in the collection. This approach can achieve considerable reduction in large collections. Furthermore, Deerwester et al argue that this method can reveal some aspects of basic linguistic notions such as synonymy or polysemy. In 1998, Papadimitriou et al [40] developed a generative probabilistic model of text corpora to study the ability of recovering aspects of the generative model from data in LSA approach. However, once we have a generative model in hand, it is not clear why we 4 should follow the LSI approach – we can attempt to proceed more directly, fitting the model to data using maximum likelihood or Bayesian methods. The probabilistic LSI (PLSI - Hoffman, 1999) [21] [22] is a significant step in this regard. The pLSI models each word in a document as a sample from a mixture model, where each mixture components are multinomial random variables that can be viewed as representation of “topics”. Consequently, each word is generated from a single topic, and different words in a document may be generated from different topics. Each document is represented as a probability distribution over a fixed set of topics. This distribution can be considered as a “reduced description” associated with the document. While Hofmann’s work is a useful step toward probabilistic text modeling, it suffers from severe overfitting problems. The number of parameters grows linearly with the number of documents. Additionally, although pLSA is a generative model of the documents in the collection it is estimated on, it is not a generative model of new documents. Latent Dirichlet Allocation (LDA) [5][20] proposed by Blei et. al. (2003) is one solution to these problems. Like all of the above methods, LDA bases on the “bag of word” assumption – that the order of words in a document can be neglected. In addition, although less often stated formally, these methods also assume that documents are exchangeable; the specific ordering of the documents in a corpus can also be omitted. According to de Finetti (1990), any collection of exchangeable random variables can be represented as a mixture distribution – in general an infinite mixture. Thus, if we wish to consider exchangeable representations for documents and words, we need to consider mixture models that capture the exchangeability of both words and documents. This is the key idea of LDA model that we will consider carefully in the section 1.3. In recent time, Blei et al have developed the two extensions to LDA. They are Dynamic Topic Models (DTM - 2006)[7] and Correlated Topic Models (CTM - 2007) [8]. DTM is suitable for time series data analysis thanks to the non-exchangeability nature of modeling documents. On the other hand, CTM is capable of revealing topic correlation, for example, a document about genetics is more likely to also be about disease than X-ray astronomy. Though the CTM gives a better fit of the data in comparison to LDA, it is so complicated by the fact that it loses the conjugate relationship between prior distribution and likelihood. In the following sections, we will discuss more about the issues behind these modeling methods with particular attention to LDA – a well-known model that has shown its efficiency and success in many applications. 5 1.2. The Early Methods 1.2.1. Latent Semantic Analysis The main challenge of machine learning systems is to determine the distinction between the lexical level of “what actually has been said or written” and the semantic level of “what is intended” or “what was referred to” in the text or utterance. This problem lies in twofold: (i) polysemy, i.e., a word has multiple meaning and multiple types of usage in different context, and (ii), synonymy and semantically related words, i.e, different words mat have similar sense. They at least in certain context specify the same concept or the same topic in a weaker sense. Latent semantic analysis (LSA - Deerwester et al, 1990) [13][24][26] is the well-known technique which partially addresses this problem. The key idea is to map from the document vectors in word space to a lower dimensional representation in the so-called concept space or latent semantic space. Mathematically, LSA relies on singular value decomposition (SVD), a well-known factorization method in linear algebra. a. Latent Semantic Analysis by SVD In the first step, we present the text corpus as term-by-document matrix where elements (i, j) describes the occurrences of term i in document j. Let X be such a matrix, X will look like this: ⎥⎥ ⎥ ⎦ ⎤ ⎢⎢ ⎢ ⎣ ⎡ → ↓ nmm n xx xx ,1, ,11,1 T i j t d L MOM L Now a row in this matrix is a vector corresponding to a term, giving its relation to each document: [ ]nii xx ,1,Tj t L= Likewise, a column in this matrix will be a vector corresponding to a document, giving its relation to each term: [ ]jmjTj xxd ,,1 L= Now, the dot product between two term vectors gives us the correlation between the terms over the documents. The matrix product pi tt T TXX contains all these dot products. 6 Element (i, p) (which equal to element (p,i) due to the symmetry) contains the dot product . Similarly, the matrix )tt(tt ippTi T= XX T contains the dot products between all the document vectors, giving their correlation over the terms: jTqqTj dddd = In the next step, we conduct the standard SVD for the X matrix and get , where U and V are orthogonal matrices and the diagonal matrix Σ contains the singular values of X. The matrix products giving us the term and document correlations are then become and respectively. TVUX Σ= IVVUU TT == TTT VUXX ΣΣ= TTT UVXX ΣΣ= Since and are diagonal we see that U must contain the eigenvectors of TΣΣ ΣΣT TXX , while V must be the eigenvectors of XX T . Both products have the same non-zero eigenvalues, given by the non-zero entries of TΣΣ , or equally, the non-zero entries of ΣΣT . Now the decomposition looks like this: The values lσσ ,...,1 are called the singular values, and and the left and right singular vectors. Note that only part of U , which contributes to , is the i-th row. Let this row vector be called . Likewise, the only part of that contributes to is the j’th column, . These are not the eigenvectors, but depend on all the eigenvectors. luu ,...,1 lvv ,...,1 it itˆ TV jd jdˆ The LSA approximation of X is computed by selecting k largest singular values, and their corresponding singular vectors from U and V. This results in the rank k approximation to X with the smallest error. The appealing thing in this approximation is that not only does it have the minimal error, but it translates the terms and document vectors into a concept space. The vector then has k entries, each gives the occurrence of term i in one of the k concepts. Similarly, the vector gives the relation between document j and each concept. We write this approximation as . Based on this approximation, we can now do the following: itˆ jdˆ T kkkk VUX Σ= - See how related documents j and q are in the concept space by comparing the vectors jdˆ and qdˆ (usually by cosine similarity). This gives us a clustering of the documents. 7 - Comparing terms i and p by comparing the vectors itˆ and ntˆ , giving us a clustering of the terms in the concept space. - Given a query, view this as a mini document, and compare it to your documents in the concept space. To do the latter, we must first translate your query into the concept space with the same transformation used on the documents, i.e. and . This means that if we have a query vector, we must do the translation before comparing it to the document vectors in the concept space. jkkj dUd ˆΣ= jTkkj dUd 1ˆ −Σ= qUq Tkk 1ˆ −Σ= b. Applications The new concept space typically can be used to: - Compare the documents in the latent semantic space. This is useful to some typical learning tasks such as data clustering or document classification. - Find similar documents across languages, after analyzing a base set of translated documents. - Find relations between terms (synonymy and polysemy). Synonymy and polysemy are fundamental problems in natural language processing: o Synonymy is the phenomenon where different words describe the same idea. Thus, a query in a search engine may fail to retrieve a relevant document that does not contain the words which appeared in the query. o Polysemy is the phenomenon where the same word has multiple meanings. So a search may retrieve irrelevant documents containing the desired words in the wrong meaning. For example, a botanist and a computer scientist looking for the word "tree" probably desire different sets of documents. - Given a query of terms, we could translate it into the concept space, and find matching documents (information retrieval). c. Limitations LSA has two drawbacks: - The resulting dimensions might be difficult to interpret. For instance, in {(car), (truck), (flower)} --> {(1.3452 * car + 0.2828 * truck), (flower)} the (1.3452 * car + 0.2828 * truck) component could be interpreted as "vehicle". However, it is very likely that cases close to 8 } } {(car), (bottle), (flower)} --> {(1.3452 * car + 0.2828 * bottle), (flower)} will occur. This leads to results which can be justified on the mathematical level, but have no interpretable meaning in natural language. - The probabilistic model of LSA does not match observed data: LSA assumes that words and documents form a joint Gaussian model (ergodic hypothesis), while a Poisson distribution has been observed. Thus, a newer alternative is probabilistic latent semantic analysis, based on a multinomial model, which is reported to give better results than standard LSA. 1.2.2. Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis [21][22] (PLSA) is a statistical technique for analysis of two-mode and co-occurrence data which has applications in information retrieval and filtering, natural language processing, machine learning from text and in related areas. Compared to standard LSA, PLSA is based on a mixture decomposition derived from a latent class model. This results in a more principled approach which has a solid foundation in statistics. a. The Aspect Model Suppose that we have given a collection of text documents with terms from a vocabulary . The starting point for PLSA is a statistical model namely aspect model. The aspect model is a latent variable model for co-occurrence data in which an unobserved variable { NddD ,...,1= { MwwW ,...,1= { }KzzZz ,...,1=∈ is introduced to capture the hidden topics implied in the documents. Here, N, M and K are the number of documents, words, and topics respectively. Hence, we model the joint probability over by the mixture as follows: DxW ∑ ∈ == Zz dzPzwPdwPdwPdPwdP )|()|()|( ),|()(),( (1.1) Like virtually all statistical latent variable models the aspect model relies on a conditional independence assumption, i.e. d and w are independent conditioned on the state of the associated latent variable (the graphical model representing this is demonstrated in Figure 1.1(a)) 9 Figure 1.1. Graphical model representation of the aspect model in the asymmetric (a) and symmetric (b) parameterization. ( [53]) It is necessary to note that the aspect model can be equivalently parameterized by (cf. Figure 1.1 (b)) ∑ ∈ = Zz zwPzdPzPwdP )|()|()(),( (1.2) This is perfectly symmetric with respect to both documents and words. b. Model Fitting with the Expectation Maximization Algorithm The aspect model is estimated by the traditional procedure for maximum likelihood estimation, i.e. Expectation Maximization. EM iterates two coupled steps: (i) an expectation (E) step in which posterior probabilities are computed for the latent variables; and (ii) a maximization (M) step where parameters are updated. Standard calculations give us the E-step formulae ∑ ∈′ ′′′ = Zz zwPzdPzP zwPzdPzPwdzP )|()|()( )|()|()(),|( (1.3) As well as the following M-step equation ∑ ∈ ∝ Dd wdzPwdnzwP ),|(),( )|( (1.4) ∑ ∈ ∝ Ww wdzPwdnzdP ),|(),( )|( (1.5) ∑∑ ∈ ∈ ∝ Dd Ww wdzPwdnzP ),|(),( )( (1.6) c. Probabilistic Latent Semantic Space 10 Let us consider topic-conditional multinomial distribution over vocabulary as points on the )|(. zp 1−M dimensional simplex of all possible multinomial. Via convex hull, the K points define a 1−≤ KL dimensional sub-simplex. The modeling assumption expressedby (1.1) is that conditional distributions for all documents are approximated by a multinomial representable as a convex combination of in which the mixture component uniquely define a point on the spanned sub-simplex which can identified with a concept space. A simple illustration of this idea is shown in )|( dwP )|( zwP )|( dzP Figure 1.2. Figure 1.2. Sketch of the probability sub-simplex spanned by the aspect model ( [53]) In order to clarify the relation to LSA, it is useful to reformulate the aspect model as parameterized by (1.2) in matrix notation. By defining , and ( kiki zdPU ,,)|(ˆ = ) ( )( ) kjkj zwPV , |ˆ = ( )( kkzPdiag=Σˆ ) matrices, we can write the joint probability model P as a matrix product . Comparing this with SVD, we can draw the following observations: (i) outer products between rows of U and reflect conditional independence in PLSA, (ii) the mixture proportions in PLSA substitute the singular values. Nevertheless, the main difference between PLSA and LSA lies on the objective function used to specify the optimal approximation. While LSA uses or Frobenius norm which corresponds to an implicit additive Gaussian noise assumption on counts, PLSA relies on the likelihood function of multinomial sampling and aims at an explicit maximization of the predictive power of the model. As is well known, this corresponds to a minimization of the cross entropy or Kullback - Leibler divergence between empirical distribution and the model, which is very different from the view of any types of squared deviation. On the modeling side, this offers crucial advantages, for example, the mixture approximation TVUP ˆˆˆΣ= ˆ Vˆ 2L P of the term-by-document matrix is a well-defined probability 11 distribution. IN contrast, LSA does not define a properly normalized probability distribution and the approximation of term-by-document matrix may contain negative entries. In addition, there is no obvious interpretation of the directions in the LSA latent space, while the directions in the PLSA space are interpretable as multinomial word distributions. The probabilistic approach can also take advantage of the well-established statistical theory for model selection and complexity control, e.g., to determine the optimal number of latent space dimensions. Choosing the number of dimensions in LSA on the other hand is typically based on ad hoc heuristics. d. Limitations In the aspect model, notice that is a dummy index into the list of documents in the training set. Consequently, d is a multinomial random variable with as many possible values as there are training documents and the model learns the topic mixtures only for those documents on which it is trained. For this reason, pLSI is not a well- defined generative model of documents; there is no natural way to assign probability to a previously unseen document. d )|( dzp A further difficulty with pLSA, which also originate from the use of a distribution indexed by training documents, is that the numbers of parameters grows linearly with the number of training documents. The parameters for a K-topic pLSI model are K multinomial distributions of size V and M mixtures over the K hidden topics. This gives KV + KM parameters and therefore linear growth in M. The linear growth in parameters suggests that the model is prone to overfitting and, empirically, overfitting is indeed a serious problem. In practice, a tempering heuristic is used to smooth the parameters of the model for acceptable predictive performance. It has been shown, however, that overfitting can occur even when tempering is used (Popescul et al., 2001, [41]). Latent Dirichlet Allocation (LDA - which is described in section 1.3. overcomes both of these problems by treating the topic mixture weights as a K-parameter hidden random variable rather than a large set of individual parameters which are explicitly linked to the training set. 1.3. Latent Dirichlet Allocation Latent Dirichlet Allocation (LDA) [7][20] is a generative probabilistic model for collections of discrete data such as text corpora. It was developed by David Blei, Andrew Ng, and Michael Jordan in 2003. By nature, LDA is a three-level hierarchical Bayesian model in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic, in turn, modeled as an infinite mixture over an 12 underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. In the following sections, we will discuss more about generative model, parameter estimation as well as inference in LDA. 1.3.1. Generative Model in LDA Given a corpus of M documents denoted by { }MdddD ,...,, 21= , in which each document number m in the corpus consists of Nm words drawn from a vocabulary of terms , the goal of LDA is to find the latent structure of “topics” or “concepts” which captured the meaning of text that is imagined to be obscured by “word choice” noise. Though the terminology of “hidden topics” or “latent concepts” has been encountered in LSA and pLSA, LDA provides us a complete generative model that has shown better results than the earlier approaches. iw { Vtt ,...,1 } Consider the graphical model representation of LDA as shown in Figure 1.3, the generative process can be interpreted as follows: LDA generates a stream of observable words , partitioned into documentsnmw , md r . For each of these documents, a topic proportion mϑ r is drawn, and from this, topic-specific words are emitted. That is, for each word, a topic indicator is sampled according to the document – specific mixture proportion, and then the corresponding topic-specific term distribution nmz , nmz , ϕr used to draw a word. The topics kϕr are sampled once for the entire corpus. The complete (annotated) generative model is presented in Figure 1.4. Figure 1.5 gives a list of all involved quantities. Figure 1.3. Graphical model representation of LDA - The boxes is “plates” representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document [20] 13 Figure 1.4. Generative model for Latent Dirichlet allocation; Here, Dir, Poiss and Mult stand for Dirichlet, Poisson, Multinomial distributions respectively. Figure 1.5. Quantities in the model of latent Dirichlet allocation 1.3.2. Likelihood According to the model, the probability that a word instantiates a particular term t given the LDA parameters is: nmw , ∑ = ===Φ= K k mnmknmmnm kzptwptwp 1 ,,, )|()|(),|( ϑϕϑ rrr (1.7) which corresponds to one iteration on the word plate of the graphical model. From the topology of the graphical model, we can further specify the complete-data likelihood of a 14 document, i.e., the joint distribution of all known and hidden variables given the hyper parameters: 43421 r 4444444 84444444 76 rr 44444 344444 21 rrrrrr plate topic document) (1 platedocument plate word , 1 , )|(.),(.)|()|(),|,,,( , βαϑϑϕβαϑ Φ=Φ ∏= ppzpwpzdp mmnmz N n nmmmm nm m (1.8) Specifying this distribution is often simple and useful as a basic for other derivations. So we can obtain the likelihood of a document md r , i.e., of the joint event of all word occurrences, as one of its marginal distributions by integrating out the distributions mϑ r and Φ and summing over : nmz , ( )∫∫ ∏∑ = ΦΦ= m nm nm N n z mmnmznmmm ddzpwpppdp 1 ,, , , |)|().|().|(),|( ϑϑϕβαϑβα rrrrrrrrr (1.9) ( ) ∏∫∫ = ΦΦΦ= m N n mmnmm ddwppp 1 , ),|().|(.| ϑϑβαϑ rrrrr (1.10) Finally, the likelihood of the complete corpus { }MmmdW 1== r is determined by the product of the likelihoods of the independent documents: ∏ = = M m mdpWp 1 ).,|(),|( βαβα rrrrr (1.11) 1.3.3. Parameter Estimation and Inference via Gibbs Sampling Exact estimation for LDA is generally intractable. The common solution to this is to use approximate inference algorithms such as mean-field variational expectation maximization, expectation propagation, and Gibbs sampling [20] a. Gibbs Sampling Gibbs sampling is a special case of Markov-chain Monte Carlo (MCMC) and often yields relatively simple algorithms for approximate inference in high-dimensional models such as LDA. Through the stationary behavior of a Markov chain, MCMC methods can emulate high-dimensional probability distributions )(xp r . This means that one sample is generated for each transition in the chain after a stationary state of the chain has been reached, which happens after a so-called “burn-in period” that eliminates the influence of initialization parameters. In Gibbs sampling, the dimensions of the distribution are ix 15 r sampled alternately one at a time, conditioned on the values of all other dimensions, which we denote . The algorithm works as follows: ix− 1. Choose dimension i (random or by permutation) 2. Sample ix from )|( ii xx p − r To build a Gibbs sampler, the full conditionals )|( ii xxp − r must be calculated, which is possible using ( ) ( ) { iiiii xxx dxxp xpxxp −− == ∫ rr r } rr , with )|( (1.12) For models that contain hidden variables zr , their posterior given the evidence, ( )xzp rr | , is a distribution commonly wanted. With Eq. 1.12, the general formulation of a Gibbs sampler for such latent-variable models becomes: ( ) ( )( )∫=− Z iii dzxzp xzpxzzp rr rrrr , ,,| (1.13) where the integral changes to a sum for discrete variables. With a sufficient number of samples [ Rrzr ,1, ]~ ∈r , the latent-variable posterior can be approximated using: ( ) ( )∑ = −≈ R r rzzR xzp 1 ~1| rrrr δ (1.14) With the Kronecker delta ( ) { }otherwise 0 ;0 if 1 == uu rrδ b. Parameter Estimation Heirich et al [20] has applied the above hidden-variable method to develop a Gibbs sampler for LDA as shown in Figure 1.6. The hidden variables here are , i.e., the topics that appear with the words of the corpus . We do not need to include the parameter sets nmz , nmw , Θ and Φbecause they are just the statistics of the associations between the observed and the corresponding , the state variables of the Markov chain. nmw , nmz , Heirich [20] has been shown a sequence of calculations to lead to the formulation of the full conditionals for LDA as follows: ( ) ( ) ( ) ( ) 11 ,| 1 , 1 , −⎥⎦ ⎤⎢⎣ ⎡ + + −⎥⎦ ⎤⎢⎣ ⎡ + += ∑∑ = − = − − z K z z m z z im v V v v z t t iz ii n n n n wzzp α α β βrr (1.15) 16 The other hidden variables of LDA can be calculated as follows: ( ) ( ) v V v v k t t k tk n n β βϕ + += ∑ =1 , (1.16) ( ) ( ) z K z z m k k m km n n α αϑ + += ∑ =1 , (1.17) - Initialization zero all count variables, , ,( )zmn mn ( )tzn , zn for all documents do [ ]Mm ,1∈ for all words in document do [ mNn ,1∈ ] m sample topic index ~Mult(1/K) nmz , increment document-topic count: ( ) 1+smn increment document-topic sum: 1+mn increment topic-term count: ( ) 1+tsn increment topic-term sum: 1+zn end for end for ] - Gibbs sampling over burn-in period and sampling period while not finished do for all documents do [ ]Mm ,1∈ for all words in document do [ mNn ,1∈ m - for the current assignment of to a term t for word : z nmw , decrement counts and sums: ( ) 1−zmn ; 1−mn ; ( ) 1−tzn ; 1−zn - multinomial sampling acc. To Eq. 1.15 (decrements from previous step): sample topic index ( )wzzpz ii rr ,|~~ − - use the new assignment of to the term t for word to: z nmw , increment counts and sums: ( ) 1+zmn r ; ;1+tznr 1+znr end for end for - check convergence and read out parameters if converged and L sampling iterations since last read out then - the different parameters read outs are averaged read out parameter set Φ acc. to Eq. 1.16 read out parameter set Θ acc. to Eq. 1.17 end if end while Figure 1.6. Gibbs sampling algorithm for Latent Dirichlet Allocation. 17 c. Inference Given an estimated LDA model, we can now do topic inference for unknown documents by a similar sampling procedure. A new document m~ is a vector of words , our goal is to estimate the posterior distribution of topics wr ~zr given the word vector of the query wr and the LDA model ( ) ( ) ),,~,~(,|:, zwwzpLwzpL rrrrrr =ΦΘ . In order to find the required counts for a complete new document, the similar reasoning is made to get the Gibbs sampling update: ( ) ( ) ( ) ( ) ( ) ( ) ( ) 11~ ~ ,;,~|~ 1 ~ , ~ 1 , −⎥⎦ ⎤⎢⎣ ⎡ + + −⎥⎦ ⎤⎢⎣ ⎡ ++ ++== ∑∑ = − = − −− z K z z m k ik m v v k V v v k t t ik t k iii n n nn nn wzwzkzp α α β βrrr (1.18) where the new variable counts the observations of term t and topic k in the unseen document. This equation gives a colorful example of the workings of Gibbs posterior sampling: High estimated word-topic associations ( )t kn r ( )t kn will dominate the multinomial masses compared to the contributions of ( )tkn~ and ( )kmn ~ , which are chosen randomly. Consequently, on repeatedly sampling from the distribution and updating of ( )kmn ~ , the masses of topic-word associations are propagated into document-topic associations. Note the smoothing influence of the Dirichlet hyper parameters. Applying Eq. 1.17 gives the topic distribution for the unknown document: ( ) ( ) z K z z m k k m km n n α αϑ + += ∑ =1 ~ ~ , (1.19) 1.3.4. Applications LDA has been successfully applied to text modeling and feature reduction in text classification [5]. Recent work has also used LDA as a building block in more sophisticated topic models such as author-document models [42], abstract-reference models [15], syntax-semantic models [18] and image-caption models [6]. Additionally, the same kinds of modeling tools have been used in a variety of non-text settings, such as image processing [46], and the modeling of user profiles [17]. 1.4. Summary This chapter has shown some typical topic analysis methods such as LSA, PLSA, and LDA. These models can be considered the basic building blocks of general framework of probabilistic modeling of text and be used to develop more sophisticated and application- 18 oriented models. These models can also be seen as key components in our proposals in subsequent chapters. Among the topic analysis methods, we pay much attention to LDA, a generative probabilistic model for collections of discrete data such as text corpora. It was developed by David Blei, Andrew Ng, and Michael Jordan in 2003 and has proven its success in many applications. Given the data, the goal is to reverse the generative process to estimate model parameters. However, exact inference or estimation for even the not-so-complex model like LDA is intractable. Consequently, there are a lot of attempts to make use of approximate approaches to this task among which Gibbs Sampling is one of the most suitable methods. Gibbs Sampling, which is also mentioned in this chapter, is a special case of Markov-chain Monte Carlo (MCMC) and often yields relatively simple algorithms for approximate inference in high-dimensional models like LDA. 19 Chapter 2. Frameworks of Learning with Hidden Topics 2.1. Learning with External Resources: Related Works In recent time, there were a lot of attempts making use of external resources to enhance learning performance. Depending on types of external resources, these methods can be roughly classified into 2 categories: those make use of unlabeled data, and those exploit structured or semi-structured data. The first category is commonly referred under the name of semi-supervised learning. The key argument is that unlabeled examples are significantly easier to collect than labeled ones. One example of this is web-page classification. Suppose that we want a program to electronically visit some web site and download all the web pages of interest to us, such as all the Computer Science faculty pages, or all the course home pages at some university. To train such a system to automatically classify web pages, one would typically rely on hand labeled web pages. Unfortunately, these labeled examples are fairly expensive to obtain because they require human effort. In contrast, the web has hundreds of millions of unlabeled web pages that can be inexpensively gathered using a web crawler. Therefore, we would like the learning algorithms to be able to take as much advantage of the unlabeled data as possible. Semi-supervised learning has been received a lot of attentions in the last decade. Yarowsky (1995) uses self-training for word sense disambiguation, e.g. deciding whether the word “plant” means a living organism or a factory in a given context. Rosenberg et. al (2005) apply it to object detection systems from images, and show the semi-supervised technique compares favorably with a state-of-the-art detector. In 2000, Nigam and Ghani [30] perform extensive empirical experiments to compare co-training with generative mixture models and Expectation Maximization (EM). Jones (2005) used co-training, co- EM and other related methods for information extraction from text. Besides, there were a lot of works that applied Transductive Support Vector Machines (TSVMs) to use unlabeled data for determining optimal decision boundary. The second category covers a lot of works exploiting resources like Wikipedia to support learning process. Gabrilovich et. al. (2007) [16] has demonstrated the value of using Wikipedia as an additional source of features for text classification and determining the semantic relatedness between texts. Banerjee et al (2007)[3] also extract titles of Wikipedia articles and use them as features for clustering short texts. Unfortunately, this approach is not very flexible in the sense that it depends much on the external resource or the application. 20 This chapter describes frameworks for learning with the support of topic model estimated from a large universal dataset. This topic model can be considered background knowledge for the domain of application. It also helps the learning process to capture hidden topics (of the domain), the relationships between topics and words as well as words and words, thus partially overcome the limitations of different word choices in text. 2.2. General Learning Frameworks This section presents general frameworks for learning with the support of hidden topics. The main motivation is how to gain benefits from huge sources of online data in order to enhance quality of the Text/Web clustering and classification. Unlike previous studies of learning with external resources, we approach this issue from the point of view of text/Web data analysis that is based on recently successful latent topic analysis models like LSA, pLSA, and LDA. The underlying idea of the frameworks is that for each learning task, we collect a very large external data collection called “universal dataset”, and then build a learner on both the learning data and a rich set of hidden topics discovered from that data collection. 2.2.1. Frameworks for Learning with Hidden Topics Corresponding to two typical learning problems, i.e. classification and clustering, we describe two frameworks with some differences in the architectures. a. Framework for Classification Figure 2.1. Classification with Hidden Topics Nowadays, the continuous development of Internet has created a huge amount of documents which are difficult to manage, organize and navigate. As a result, the task of automatic classification, which is to categorize textual documents into two or more predefined classes, has been received a lot of attentions. 21 Several machine-learning methods have been applied to text classification including decision trees, neural networks, support vector machines, etc. In the typical applications of machine-learning methods, the training data is passed to a learning phrase. The result of the learning step is an appropriate classifier capable of categorizing new documents. However, in the cases such as the training data is not as much as expected or the data to be classified is too rare [52], learning with only training data can not provide us a satisfactory classifier. Inspired by this fact, we propose a framework that enables us to enrich both training and new coming data with hidden topics from available large dataset so as to enhance the performance of text classification. Classification with hidden topics is described in Figure 2.1. We first collect a very large external data collection called “universal dataset”. Next, a topic analysis technique such as pLSA, LDA, etc. is applied to the dataset. The result of this step is an estimated topic model which consists of hidden topics and the probability distributions of words over these topics. Upon this model, we can do topic inference for training dataset and new data. For each document, the output of topic inference is a probability distribution of hidden topics – the topics analyzed in the estimation phrase – given the document. The topic distributions of training dataset are then combined with training dataset itself for learning classifier. In the similar way, new documents, which need to be classified, are combined with their topic distributions to create the so called “new data with hidden topics” before passing to the learned classifier. b. Framework for Clustering Figure 2.2. Clustering with Hidden Topics Text clustering is to automatically generate groups (clusters) of documents based on the similarity or distance among documents. Unlike Classification, the clusters are not known 22 previously. User can optionally give the requirement about the number of clusters. The documents will later be organized in clusters, each of which contains “close” documents. Clustering algorithms can be hierarchical or partitional. Hierarchical algorithms find successive clusters using previously established clusters, whereas partitional algorithms determine all clusters at once. Hierarchical algorithms can be agglomerative (“bottom- up”) or divisive (“top-down”). Agglomerative algorithms begin with each element as a separate cluster and merge them into larger ones. Divisive algorithm begin with the while set and divide it into smaller ones. Distance measure, which determines how similarity of two documents is calculated, is a key to the success of any text clustering algorithms. Some documents may be close to one another according to one distance and further away according to another. Common distance functions are the Euclidean distance, the Manhattan distance (also called taxicab norm or 1-norm) and the maximum norm, just to name here but a few. Web clustering, which is a type of text clustering specific for web pages, can be offline or online. Offline clustering is to cluster the whole storage of available web documents and does not have the constraint of response time. In online clustering, the algorithms need to meet the “real-time condition”, i.e. the system need to perform clustering as fast as possible. For example, the algorithm should take the document snippets instead of the whole documents as input since the downloading of original documents is time- consuming. The question here is how to enhance the quality of clustering for such document snippets in online web clustering. Inspired by the fact those snippets are only small pieces of text (and thus poor in content) we propose the framework to enrich them with hidden topics for clustering (Figure 2.2). This framework and topic analysis is similar to one for classification. The difference here is only due to the essential differences between classification and clustering. 2.2.2. Large-Scale Web Collections as Universal Dataset Despite of the obvious differences between two learning frameworks, there is a key phrase sharing between them – the phrase of analyzing topics for previously collected dataset. Here are some important considerations for this phrase: - The degree of coverage of the dataset: the universal dataset should be large enough to cover topics underlined in the domain of application. - Preprocessing: this step is very important to get good analysis results. Although there is no general instruction for all languages, the common advice is to remove as much as possible noise words such as functional words, stop words or too frequent/ rare words. 23 - Methods for topic analysis: Some analyzing methods which can be applied have been mentioned in the Chapter 1. The tradeoff between the quality of topic analysis and time complexity should be taken into account. For example, topic analysis for snippets in online clustering should be as short as possible to meet the “real-time” condition. 2.3. Advantages of the Frameworks - The general frameworks are flexible and general enough to apply in any domain/language. Once we have trained a universal dataset, its hidden topics could be useful for several learning tasks in the same domain. - This is particularly useful for sparse data mining. Spare data like snippets returned from a search engine could be enriched with hidden topics. Thus, enhanced performance can be achieved. - Due to learning with smaller data, the presented methods require less computational resources than semi-supervised learning. - Thank to the nice generative model for analyzing topics for new documents (in the case of LDA), we have a natural way to map documents from term space into topic space. This is really an advantage over heuristic-based mapping in the previous approaches [16][3][10]. 2.4. Summary This chapter described two general frameworks and their advantages for learning with hidden topics: one for classification and one for clustering. The main advantages of our frameworks are that they are flexible and general to apply in any domain/language and be able to deal with sparse data. The key common phrase between the two frameworks is topic analysis for large-scale web collection called “universal dataset”. The quality of the topic model estimation for this data will influence much the performance of learning in the later phrases. 24 Chapter 3. Topics Analysis of Large-Scale Web Dataset As mentioned earlier, topic analysis for a universal dataset is a key to the success of our proposed methods. Thus, toward Vietnamese text mining, this chapter contributes to considerations for the problem of topics analysis for large-scale web datasets in Vietnamese. 3.1. Some Characteristics of Vietnamese Vietnamese is the national and official language of Vietnam [48]. It is the mother tongue of the Vietnamese people who constitute 86% of Vietnam’s population, and of about three million overseas Vietnamese. It is also spoken as a second language by some ethnic minorities of Vietnam. Many words in Vietnamese are borrowed from Chinese. Originally, it is written in Chinese-like writing system. The current writing system of Vietnamese is a modification of Latin alphabet, with additional diacritics for tones and certain letters. 3.1.1. Sound a. Vowels Like other Southeast Asian languages, Vietnamese has a comparatively large number of vowels. Below is a vowel chart of vowels in Vietnamese: Table 3.1. Vowels in Vietnamese The correspondence between the orthography and pronunciation is rather complicated. For example, the vowel i is often written as y; both may represent [i], in which case the difference is in the quality of the preceding vowel. For instance, “tai” (ear) is [tāi] while tay (hand/arm) is [tāj]. In addition to single vowels (or monophthongs), Vietnamese has diphthongs (âm đôi). Three diphthongs consist of a vowel plus a. These are: “ia”, “ua”, “ưa” (When followed by a consonant, they become “iê”, “uô”, and “ươ”, respectively). The other diphthongs 25 consist of a vowel plus semivowel. There are two of these semivowels: y (written i or y). A majority of diphthongs in Vietnamese are formed this way. Furthermore, these semivowels may also follow the first three diphthongs (“ia”, “ua”, “ưa”) resulting in tripthongs. b. Tones Vietnamese vowels are all pronounced with an inherent tone. Tones differ in pitch, length, contour melody, intensity, and glottal (with or without accompanying constricted vocal cords) Tone is indicated by diacritics written above or below the vowel (most of the tone diacritics appear above the vowel; however, the “nặng” tone dot diacritic goes below the vowel). The six tones in Vietnamese are: Table 3.2. Tones in Vietnamese c. Consonants The consonants of the Hanoi variety are listed in the Vietnamese orthography, except for the bilabial approximant which is written as “w” (in the writing system it is written the same as the vowels “o” and “u” Some consonant sounds are written with only one letter (like “p”), other consonant sounds are written with a two-letter digraph (like “ph”), and others are written with more than one letter or digraph (the velar stop is written variously as “c”, “k”, or “q”). 26 Table 3.3. Consonants of hanoi variety 3.1.2. Syllable Structure Syllables are elementary units that have one way of pronunciation. In documents, they are usually delimited by white-space. In spite of being the elementary units, Vietnamese syllables are not undividable elements but a structure. Table 3.4 depicts the general structure of Vietnamese syllable: Table 3.4. Structure of Vietnamese syllables TONE MARK Rhyme First Consonant Secondary Consonant Main Vowel Last Consonant Generally, each Vietnamese syllable has all five parts: first consonant, secondary vowel, main vowel, last consonant and a tone mark. For instance, the syllable “tuần” (week) has a tone mark (grave accent), a first consonant (t), a secondary vowel (u), a main vowel (â) and a last consonant (n). However, except for main vowel that is required for all syllables, the other parts may be not present in some cases. For example, the syllable “anh” (brother) has no tone mar, no secondary vowel and no first consonant. Another example is the syllable “hoa” (flower) has a secondary vowel (o) but no last consonant. 3.1.3. Vietnamese Word Vietnamese is often erroneously considered to be a "monosyllabic" language. It is true that Vietnamese has many words that consist of only one syllable; however, most words indeed contain more than one syllable. Based on the way of constructing words from syllables, we can classify them into three classes: single words, complex words and reduplicative words. Each single word has only 27 one syllable that implies specific meaning. For example: “tôi” (I), “bạn” (you), “nhà” (house), etc. Words that involve more than one syllable are called “complex word”. The syllables in complex words are combined based on semantic relationships which are either coordinated (“bơi lội” – swim) or “principle and accessory” (“đường sắt” –railway). A word is considered as a reduplicative word if its syllables have phonic components (Table 3.4) reduplicated, for instance: “đùng đùng” (full-reduplicative), “lung linh” (first consonant reduplicated), etc. This type of words is usually used for scene or sound descriptions particularly in the literary. 3.2. Preprocessing and Transformation Data preprocessing and Transformation are necessary steps for any data mining process in general and for hidden topics mining in particular. After these steps, data is clean, complete, reduced, partially free of noises, and ready to be mined. The main steps for our preprocessing and transformation are described in the subsequent sections and shown in the following chart: Figure 3.1. Pipeline of Data Preprocessing and Transformation 3.2.1. Sentence Segmentation Sentence segmentation is to determine whether a ‘sentence delimiter’ is really a sentence boundary. Like English, sentence delimiters in Vietnamese are full-stop, the exclamation mark and the question mark (.!?). The exclamation mark and the question mark do not really pose the problems. The critical element is again the period: (1) the period can be a sentence-ending character (full stop); (2) the period can denote an abbreviation; (3) the period can used in some expressions like URL, Email, numbers, etc.; (4) in some cases, a period can assume both (1) and (2) functions. Given an input string, the result of this detector are sentences, each of which is in one line. Then, this output is shifted to the sentence tokenization step. 28 3.2.2. Sentence Tokenization Sentence tokenization is the process of detaching marks from words in a sentence. For example, we would like to detach “,” from its previous word. 3.2.3. Word Segmentation As mentioned in Section 3.1. , Vietnamese words are not always determined by white- spaces due to the fact that each word can contain more than one syllable. This gives birth to the task of word segmentation, i.e. segment a sentence into a sequence of words. Vietnamese word segmentation is a perquisite for any further processing and text mining. Though being quite basic, it is not a trivial task because of the following ambiguities: - Overlapping ambiguity: String αβγ were called overlapping ambiguity when both αβ and βγ are valid Vietnamese word. For example: “học sinh học sinh học” (Student studies biology) Î “học sinh” (student) and “sinh học” (biology) are found in Vietnamese dictionary. - Combination ambiguity: String αβγ were called combination ambiguity when ( ),, αββα are possible choices. For instance: “bàn là một dụng cụ” (Table is a tool) Î “bàn” (Table), “bàn là” (iron), “là” (is) are found in Vietnamese dictionary. In this work, we used Conditional Random Fields approach to segment Vietnamese words[31] . The outputs of this step are sequences of syllables joined to form words. 3.2.4. Filters After word segmentation, tokens now are separated by white-space. Filters remove trivial tokens for analyzing process, i.e. tokens for number, date/time, too-short tokens (length is less than 2 characters). Too short sentences, English sentences, or Vietnamese sentences without tones (The Vietnamese sometimes write Vietnamese text without tone) also should be filtered or manipulated in this phrase. 3.2.5. Remove Non Topic-Oriented Words Non Topic-Oriented Words are those we consider to be trivial for topic analyzing process. These words can cause much noise and negative effects for our analysis. Here, we treat functional words, too rare or too common words as non topic-oriented words. See the following table for more details about functional words in Vietnamese: 29 Table 3.5. Functional words in Vietnamese Part of Speech (POS) Examples Classifier Noun cái, chiếc, con, bài, câu, cây, tờ, lá, việc Major/Minor conjunction Bởi chưng, bởi vậy, chẳng những, … Combination Conjunction Cho, cho nên, cơ mà, cùng, dẫu, dù, và Introductory word Gì, hẳn, hết, … Numeral Nhiều, vô số, một, một số, … Pronoun Anh ấy, cô ấy, … Adjunct Sẽ, sắp sửa, suýt, … 3.3. Topic Analysis for VnExpress Dataset We collect a large dataset from VnExpress [47] using Nutch [36] and then do preprocessing and transformation. The statistics of the topics assigned by humans and other parameters of the dataset are shown in the tables below: Table 3.6. Statistics of topics assigned by humans in VnExpress Dataset Society: Education, Entrance Exams, Life of Youths … International: Analysis, Files, Lifestyle … Business: Business man, Stock, Integration … Culture: Music, Fashion, Stage – Cinema … Sport: Football, Tennis Life: Family, Health … Science: New Techniques, Natural Life, Psychology And Others … Note that information about topics assigned by humans is just listed here for reference and not used in the topic analysis process. After data preprocessing and transformation, we get 53M data (40,268 documents, 257,533 words; vocabulary size of 128,768). This data is put into GibbLDA++ [38] – a tool for Latent Dirichlet Allocation using Gibb Sampling (see Section 1.3. ). The results of topic analysis with K = 100 topics are shown in 30 Table 3.5. Table 3.7. Statistics of VnExpress dataset After removing html, doing sentence and word segmentation: size ≈219M, number of docs = 40,328 After filtering and removing non-topic oriented words: size ≈53M, number of docs = 40,268 number of words = 5,512,251; vocabulary size = 128,768 Table 3.8 Most likely words for sample topics. Here, we conduct topic analysis with 100 topics. Topic 1 Topic 3 Topic 7 Tòa (Court) 0.0192 Điều tra (Investigate) 0.0180 Luật sư (Lawyer) 0.0162 Tội (Crime) 0.0142 Tòa án (court) 0.0108 Kiện (Lawsuits) 0.0092 Buộc tội (Accuse) 0.0076 Xét xử (Judge) 0.0076 Bị cáo (Accused) 0.0065 Phán quyết (Sentence) 0.0060 Bằng chứng (Evidence) 0.0046 Thẩm phán (Judge) 0.0050 Trường (School) 0.0660 Lớp (Class) 0.0562 Học sinh (Pupil) 0.0471 Giáo dục (Education) 0.0192 Dạy (Teach) 0.0183 Giáo viên (Teacher) 0.0179 Môn (Subject) 0.0080 Tiểu học (Primary school)0.0070 Hiệu trưởng (Rector) 0.0067 Trung học (High school) 0.0064 Tốt nghiệp (Graduation) 0.0063 Năm học (Academic year)0.0062 Game 0.0869 Trò chơi (Game) 0.0386 Người chơi (Gamer) 0.0211 Nhân vật (Characters) 0.0118 Online 0.0082 Giải trí (Entertainment) 0.0074 Trực tuyến (Online) 0.0063 Phát hành (Release) 0.0055 Điều khiển (Control) 0.0052 Nhiệm vụ (Mission) 0.0041 Chiến đấu (Fight) 0.0038 Phiên bản (Version) 0.0038 Topic 9 Topic 14 Topic 15 Du lịch (Tourism) 0.0542 Khách (Passengers) 0.0314 Khách sạn (Hotel) 0.0276 Du khách (Tourists) 0.0239 Tour 0.0117 Tham quan (Visit) 0.0097 Biển (Sea) 0.0075 Chuyến đi (Journey) 0.0050 Giải trí (Entertainment) 0.0044 Khám phá (Discovery) 0.0044 Lữ hành (Travel) 0.0039 Điểm đến (Destination) 0.0034 Thời trang (Fashion) 0.0482 Người mẫu (Models) 0.0407 Mặc (Wear) 0.0326 Mẫu (Sample) 0.0305 Trang phục (Clothing) 0.0254 Đẹp (Nice) 0.0249 Thiết kế (Design) 0.0229 Sưu tập (Collection) 0.0108 Váy (Skirt) 0.0105 Quần áo (Clothes) 0.0092 Phong cách (Styles) 0.0089 Trình diễn (Perform) 0.0051 Bóng đá (Football) 0.0285 Đội (Team) 0.0273 Cầu thủ (Football Players)0.0241 HLV (Coach) 0.0201 Thi đấu (Compete) 0.0197 Thể thao (Sports) 0.0176 Đội tuyển (Team) 0.0139 CLB (Club) 0.0138 Vô địch (Championship) 0.0089 Mùa (Season) 0.0063 Liên đoàn (Federal) 0.0056 Tập huấn (Training) 0.0042 3.4. Topic Analysis for Vietnamese Wikipedia Dataset The second dataset is collected from Vietnamese Wikipedia, and contains D=29,043 documents. We preprocessed this dataset in the same way described in the Section 3.2. This led to a vocabulary size of V = 63,150, and a total of 4,784,036 word tokens. In the 31 hidden topic mining phrase, the number of topics K was fixed at 200. The hyper- parameters α and β were set at 0.25 and 0.1 respectively. Table 3.9. Statistic of Vietnamese Wikipedia Dataset After removing html, doing sentence and word segmentation: size ≈270M, number of docs = 29,043 After filtering and removing non-topic oriented words: size ≈48M, number of docs = 17,428 number of words = 4,784,036; vocabulary size = 63,150 Table 3.10 Most likely words for sample topics. Here, we conduct topic analysis with 200 topics Topic 2 Topic 5 Topic 6 Tàu (Ship) 0.0527 Hải quân (Navy) 0.0437 Hạm đội (Fleet) 0.0201 Thuyền (Ship) 0.0100 Đô đốc (Admiral) 0.0097 Tàu chiến (Warship) 0.0092 Cảng (Harbour) 0.0086 Tấn công (Attack) 0.0081 Lục chiến (Marine) 0.0075 Thủy quân (Seaman) 0.0067 Căn cứ (Army Base) 0.0066 Chiến hạm (Gunboat) 0.0058 Độc lập (Independence) 0.0095 Lãnh đạo (Lead) 0.0088 Tổng thống (President) 0.0084 Đất nước (Country) 0.0070 Quyền lực (Power) 0.0069 Dân chủ (Democratic) 0.0068 Chính quyền (Government)0.0067 Ủng hộ (Support) 0.0065 Chế độ (System) 0.0063 Kiểm soát (Control) 0.0058 Lãnh thổ (Territory) 0.0058 Liên bang (Federal) 0.0051 Động vật (Animal) 0.0220 Chim (Bird) 0.0146 Lớp (Class) 0.0123 Cá sấu (Crocodiles) 0.0116 Côn trùng (Insect) 0.0113 Trứng (Eggs) 0.0093 Cánh (Wing) 0.0092 Vây (Fin) 0.0077 Xương (Bone) 0.0075 Phân loại (Classify) 0.0054 Môi trường (Environment)0.0049 Xương sống (Spine) 0.0049 Topic 8 Topic 9 Topic 17 Nguyên tố (Element) 0.0383 Nguyên tử (Atom) 0.0174 Hợp chất (Compound) 0.0172 Hóa học (Chemical) 0.0154 Đồng vị (Isotope) 0.0149 Kim loại (Metal) 0.0148 Hidro (Hidro) 0.0142 Phản ứng (Reaction) 0.0123 Phóng xạ (Radioactivity) 0.0092 Tuần hoàn (Circulation) 0.0086 Hạt nhân (Nuclear) 0.0078 Điện tử (Electronics) 0.0076 Trang (page) 0.0490 Web (Web) 0.0189 Google (Google) 0.0143 Thông tin (information) 0.0113 Quảng cáo(advertisement)0.0065 Người dùng(user) 0.0058 Yahoo (Yahoo) 0.0054 Internet (Internet) 0.0051 Cơ sở dữ liệu (database) 0.0044 Rss (RSS) 0.0041 HTML (html) 0.0039 Dữ liệu (data) 0.0038 Lực (Force) 0.0487 Chuyển động (Move) 0.0323 Định luật (Law) 0.0289 Khối lượng (Mass) 0.0203 Quy chiếu (Reference) 0.0180 Vận tốc (Velocity) 0.0179 Quán tính (Inertia) 0.0173 Vật thể (Object) 0.0165 Newton (Newton) 0.0150 Cơ học (Mechanics) 0.0149 Hấp dẫn (Attractive) 0.0121 Tác động (Influence) 0.0114 3.5. Discussion The hidden topics analysis using LDA for both VnExpress and Vietnamese Wikipedia datasets have shown satisfactory results. While VnExpress dataset is more suitable for daily life topic analysis, Vietnamese Wikipedia dataset is good for scientific topic 32 modeling. The decision of which one is suitable for a task depends much on its domain of application. From experiments, it can be seen that the number of topics should be appropriate to the nature of dataset and the domain of application. If we choose a large number of topics, the analysis process can generate a lot of topics which are too close (in the semantic) to each others. On the other hand, if we assign a small number of topics, the results can be too common. Hence, the learning process can benefits less from this topic information. When conducting topic analysis, one should consider data very carefully. Preprocessing and transformation are important steps because noise words can cause negative effects. In Vietnamese, focus should be made on word segmentation, stop words filter. Also, common personal names in Vietnamese should be removed. In other cases, it is necessary to either remove all Vietnamese sentences written without tones (this writing style is quite often in online data in Vietnamese) or do tone recovery for them. Other considerations also should be made for Vietnamese Identification or Encoding conversions, etc., due to the complex variety of online data. 3.6. Summary This chapter summarized major issues for topics analysis of 2 specific datasets in Vietnamese. We first reviewed some characteristics in Vietnamese. These considerations are significant for dataset preprocessing and transformation in the subsequent processes. We then described each step of preprocessing and transforming data. Significant notes, including specific characteristics of Vietnamese, are also highlighted. In the last part, we demonstrated the results from topics analysis using LDA for some dataset in Vietnamese. The results showed that LDA is a potential method for topics analysis in Vietnamese. 33 Chapter 4. Deployments of General Frameworks This chapter goes further into details of the deployments of general frameworks for the two tasks: classification and clustering for Vietnamese Web Search Results. Evaluation and Analysis for our proposals are also considered in the next subsections. 4.1. Classification with Hidden Topics 4.1.1. Classification Method Figure 4.1. Classification with VnExpress topics The objective of classification is to automatically categorize new coming documents into one of k classes. Given a moderate training dataset, an estimated topic model and k classes, we would like to build a classifier based on the framework in Figure 4.1. Here, we use the model estimated from VnExpress dataset with LDA (see section 3.3. for more details). In the following subsections, we will discuss more about important issues of this deployment. a. Data Description For training and testing data, we first submit queries to Google and get results through Google API [19]. The number of query phrases and snippets in each train and test dataset are shown in Table 4.1 Google search results as training and testing dataset. The search phrases for training and test data are designed to be exclusive. Note that, the training and testing data here are designed to be as exclusive as possible. b. Combining Data with Hidden Topics The outputs of topic inference for train/new data are topic distributions, each of which corresponds to one snippet. We now have to combine each snippet with its hidden topics. 34 This can be done by a simple procedure in which the occurrence frequency of a topic in the combination depends on its probability. For example: a topic with probability greater than 0.03 and less than 0.05 have 2 occurrences, while a topic with probability less than 0.01 is not included in the combination. One demonstrated example is shown in Figure 4.2. Table 4.1 Google search results as training and testing dataset. The search phrases for training and test data are designed to be exclusive Training dataset Testing dataset Domains #phrases #snippets #phrases #snippets Business 50 1.479 9 270 Culture-Arts 49 1.350 10 285 Health 45 1.311 8 240 Laws 52 1.558 10 300 Politics 32 957 9 270 Science – Education 41 1.229 9 259 Life-Society 19 552 8 240 Sports 45 1.267 9 223 Technologies 51 1.482 9 270 c. Maximum Entropy Classifier The motivating idea behind maximum entropy [34][35] is that one should prefer the most uniform models that also satisfy any given constraints. For example, consider a four-class text classification task where we told only that on average 40% documents with the word “professor” in them are in the faculty class. Intuitively, when given a document with “professor” in it, we would say it has a 40% chance of being a faculty document, and a 20% chance for each of the other three classes. If a document does not have “professor” we would guess the uniform class distribution, 25% each. This model is exactly the maximum entropy model that conforms to our known constraint. Although maximum entropy can be used to estimate any probability distribution, we only consider here the classification task; thus we limit the problem to learning conditional distributions from labeled training data. Specifically, we would like to learn the conditional distribution of the class label given a document. 35 Figure 4.2 Combination of one snippet with its topics: an example Constraints and Features In maximum entropy, training data is used to set constraints on the conditional distribution. Each constraint shows a characteristic of the training data and the class should be present in the learned distribution. Any real-valued function of the document and the class can be a feature: . Maximum Entropy enables us to restrict the model distribution to have the same expected value for this feature as seen in the training data, . Thus, we stipulate that the learned conditional distribution (here, c stands for class, and d represents document) must have the below form: ),( cdfi D )|( dcP ( ) ( )⎟⎟⎠ ⎞ ⎜⎜⎝ ⎛= ∑ i ii cdfdZ dcP ,exp1)|( λ (4.1) Where each is a feature, ( cdf i , ) iλ is a parameter which needs to be estimated and ( )dZ is simply the normalizing factor to ensure a proper probability: ( ) ( )∑ ∑= c c ii cdfdZ ,exp λ There are several methods for estimating maximum entropy model from training data such as IIS (improved iterative scaling), GIS, L-BFGS, and so forth. 36 Maximum Entropy for Classification In order to apply maximum entropy, we need to select a set of features. For this work, we use words in documents as our features. More specifically, for each word-class combination, we instantiate a feature as: ⎩⎨ ⎧ == otherwise 0 contains d and ' if 1 ),(', wcc cdf cw Here, c’ is a class, w is a specific word, and d is current document. This feature will check whether “this document d contains the word w and belongs to the class c’ ”. The predicate which states that “this document d contains the word w” is called the “context predicate” of the feature. 4.1.2. Experiments a. Experimental Settings For all the experiments, we based on hidden topics analysis with LDA as described in the previous chapter. We then conduct several experiments: one for learning without hidden topics and the others for learning with different numbers of topic models of the VnExpress dataset which are generated by doing topic analysis for VnExpress dataset with 60, 80, 100, 120, 140 and 160 topics. For learning maximum entropy classifier, we use JMaxent [39] and set the context predicate and feature thresholds to be zero; the other parameters are set at defaults. b. Evaluation Traditional Classification use Precision, Recall and F-1 measure to evaluate the performance of the system. The meanings of such measures are given below: Precision of a classifier with respect to a class is the fraction of the number of examples which are correctly categorized into that class over the number of examples which are classified into that class: c class as classified examples# c class as classifiedcorrectly examples#Precisionc = Recall of a classifier with respect to a class is the fraction of the number of examples which are correctly categorized into that class over the number of examples which belong to that class (by human assignment): c class tobelong examples all# c class as classifiedcorrectly examples#Recallc = 37 To measure the performance of a classifier, it is usually used F-1 measure which is the harmonic mean of precision and recall: cc cc c RecallPrecision RecallPrecision2 1-F + ××= c. Experimental Results and Discussion 62 64 66 68 70 72 74 Wi tho ut To pic s 60 to pic s 80 to pic s 10 0 t op ics 12 0 t op ics 14 0 t op ics 16 0 t op ics Precision Recall F1-measure Figure 4.3. Learning with different topic models of VnExpress dataset; and the baseline (without topics) 62.02 70.86 72.45 72.25 71.91 72.26 65.94 66.41 67.08 66 56 58 60 62 64 66 68 70 72 74 1.3 5 2.7 4.0 5 5.3 52 6.5 52 7.7 52 8.8 59 9.9 09 10 .72 11 .13 Size of labeled training data (x1000 examples) F1 m ea su re (% ) w ith hidden topic inference baseline (w ithout topics) Figure 4.4. Test-out-of train with increasing numbers of training examples. Here, the number of topics is set at 60topics 38 Table 4.2. Experimental results of baseline (learning without topics) Class Human Model Match Pre. Rec. F1-score Business 270 347 203 58.50 75.19 65.80 Culture-Arts 285 260 183 70.38 64.21 67.16 Health 240 275 179 65.09 74.58 69.51 Laws 300 374 246 65.78 82.00 73.00 Politics 270 244 192 78.69 71.11 74.71 Science 259 187 121 64.71 46.72 54.26 Society 240 155 106 68.39 44.17 53.67 Sports 223 230 175 76.09 78.48 77.26 Technologies 270 285 164 57.54 60.74 59.10 Avg.1 67.24 66.35 66.79 Avg.2 2357 2357 1569 66.57 66.57 66.57 Table 4.3. Experimental results of learning with 60 topics of VnExpress dataset Class Human Model Match Pre. Rec. F1-score Business 270 275 197 71.64 72.96 72.29 Culture-Arts 285 340 227 66.76 79.65 72.64 Health 240 256 186 72.66 77.5 75 Laws 300 386 252 65.28 84 73.47 Politics 270 242 206 85.12 76.3 80.47 Science 259 274 177 64.6 68.34 66.42 Society 240 124 97 78.23 40.42 53.3 Sports 223 205 173 84.39 77.58 80.84 Technologies 270 255 180 70.59 66.67 68.57 Avg.1 73.25 71.49 72.36 Avg.2 2357 2357 1695 71.91 71.91 71.91 39 50 55 60 65 70 75 80 85 90 95 100 Bu sin es s Cu ltu re -A rts He alt h La ws Po liti cs Sc ien ce So cie ty Sp or ts Te ch no log ies Av er ag e F1 -M ea su re Without Topics With Hidden Topics Inference Figure 4.5 F1-Measure for classes and average (over all classes) in learning with 60 topics Figure 4.3 shows the results of learning with different settings (without topics, with 60, 80, 100, 120, 140 topics) among which learning with 60 topics got the highest F-1 measure (72.91% in comparison with 66.57% in baseline – see Table 4.2 and Table 4.3). When the number of topics increase, the F-1 measures vary around 70-71% (learning with 100, 120, 140 topics). This shows that learning with hidden topics does improve the performance of classifier no matter how many numbers of topics is chosen. Figure 4.4 depicts the results of learning with 60 topics and different number of training examples. Because the testing dataset and training dataset are relatively exclusive, the performance is not always improved when the training size increases. In any cases, the results for learning with topics are always better than learning without topics. Even with little training dataset (1300 examples), the F-1 measure of learning with topics is quite good (70.68%). Also, the variation of F-1 measure in experiments with topics (2% - from 70 to 72%) is smaller than one without topics (8% - from 62 to 66%). From these observations, we see that our method does take effects even with little learning data. 40 4.2. Clustering with Hidden Topics 4.2.1. Clustering Method Figure 4.6. Clustering with Hidden Topics from VnExpress and Wikipedia data Web search clustering is a solution to reorganize search results in a more convenient way for users. For example, when a user submits query “jaguar” into Google and wants to get search results related to “big cats”, s/he need go to the 10th, 11th, 32nd, and 71st results. If there is a group named “big cats”, the four relevant results can be ranked high in the corresponding list. Among previous works, the noticeable and most successful clustering system is Vivisimo [49] in which the techniques are kept unknown. This section considers deployment issues of clustering web search results with hidden topics in Vietnamese. a. Topic Inference and Similarity For each snippet, after topic inference we get the probability distribution of topics over the snippet. From that topic distribution for each snippet, we construct the topic vector for that snippet as following: the weight of a topic will be assigned zero if its probability less than a predefined ‘cutt-off threshold’, and be assigned the value of its probability otherwise. Suppose that weights for words in the term vector of the snippet have been normalized in some way (tf; tf-idf; etc), the combined vector corresponding to snippet i- th has the following form: { }||121 ,...,,,...,, VKi wwtttd = (4.2) 41 Here, ti is the weight for topic i-th in K analyzed topics (K is a constant parameter of LDA); wi is the weight for word/term i-th in vocabulary V of all snippets. Next, for 2 snippets i-th and j-th, we use the cosine similarity to measure the similarities between topic-parts as well as between term-parts of the 2 vectors. ∑∑ ∏ == = × =− K k kj K k ki K k kjki ji tt tt partstopicsim 1 2 , 1 2 , 1 ,, , )( ∑∑ ∏ == = × =− || 1 2 , || 1 2 , || 1 ,, , )( V t tj V t ti V t tjti ji ww ww partswordsim We later propose the following combination to measure the final similarity between them: ( ) parts)-word(1)partstopic(),( simsimddsim ji ×−+−×= λλ (4.3) Here, λ is a mixture constant. If 0=λ , we calculate the similarity without the support of hidden topics. If 1=λ , we measure the similarity between 2 snippets from hidden topic distributions without concerning words in snippets. b. Agglomerative Hierarchical Clustering Hierarchical clustering [48] builds (agglomerative), or breaks up (divisible), a hierarchy of clusters. The traditional representation of this hierarchy is a tree called dendrogram. Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Cutting the tree at a given height will give a clustering at a selected precision. In the example in Figure 4.7, cutting after the second row will generate clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with a smaller number of larger clusters. The method builds the hierarchy from the individual elements by progressively merging clusters. In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen similarity. Optionally, one can construct a similarity matrix at this stage, where the number in the i- th row j-th column is the similarity/distance between the i-th and j-th elements. Then, as 42 clustering progresses, rows and columns are merged as the clusters are merged and the similarities updated. This is a common way to implement this type of clustering, and has the benefit of catching distances between clusters. Figure 4.7. Dendrogram in Agglomerative Hierarchical Clustering Suppose that we have merged the two closest elements b and c, we now have the following clusters {a},{b,c},{d},{e}, and {f}, and want to merge them further. To do that, we need to measure the similarity/distance between {a} and {b c}, or generally similarity between two clusters. Usually the similarity between two clusters A and B can be calculated as one of the following: - The minimum similarity between elements of each cluster (also called complete linkage clustering): ( ){ }ByAxyxsim ∈∈ ,:,min - The maximum similarity between elements of each cluster (also called single linkage clustering): ( ){ }ByAxyxsim ∈∈ ,:,max - The mean similarity between elements of each clusters (also called average linkage clustering): ( )∑∑ ∈ ∈Ax By yxsim BA , |||| 1 Each agglomerative occurs at a smaller similarity between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (similarity criterion) or when there is a sufficiently small number of clusters (number criterion). 43 c. Labeling Clusters Given a set of clusters for a text collection, our goal is to generate understandable semantic labels for each cluster. We now state the problem of cluster labeling similarly to the problem of “topic labeling problem” [27] as follows: Definition 1: A cluster ( ) in a text collection has a set of “close” snippets (here, we consider snippets are small documents), each cluster is characterized by an “expected topic distribution” Cc∈ cθ which is the average of topic distributions of all snippets in the cluster. Definition 2: A “cluster label” or a “label” l for a cluster Cc∈ is a sequence of words which is semantically meaningful and covers the latent meaning of cθ . Words, phrases, and sentences are all valid labels under this definition. Definition 3 (Relevance Score) The relevance score of a label to a cluster cθ , ( )cls θ, , measures the semantic similarity between the label and the topic model. Given that both and are meaningful candidate labels, is a better label for c than if 1l 2l 1l 2l ( ) ( cc lsls )θθ ,, 21 > With these definitions, the problem of cluster labeling can be defined as follows: Let be a set of N clusters, and },...,,{ 21 NcccC = { }siiii lllL ,2,1, ,...,,= be the set of candidate cluster labels for the cluster number i in C. Our goal is to select the most likely label for each cluster. Candidate Label Generation Candidate label is the first phrase to label clusters. In this work, we generate candidates based on “Ngram Testing” which extract meaningful phrases from word n-grams based on statistical tests. There are many methods for testing whether an n-gram is meaningful collocation/phrase or just co-occur by accidence. Some methods depend on statistical measures such as mutual information. Others rely on hypothesis testing techniques. The null hypothesis usually assumes that “the words in an n-gram are independent”, and different test statistics have been proposed to test the significance of violating the null hypothesis. For the experiments, we use the n-gram hypothesis testing (n <=2) which depend on chi- square test [11] to find out meaningful phrases. In other words, there are two types of label candidates: (1) non-stop words (1-gram); and (2) a phrase of 2 consecutive words (2-grams) with its chi-square value calculated from a large collection of text is greater than a threshold - the “colocThreshold”. 44 Table 4.4. Some collocations with highest values of chi-square statistic Collocation (Meaning in Enlish) Chi-square value TP HCM (HCM city) 2.098409912555148E11 Monte Carlo (Monte Carlo) 2.3750623868571806E9 Thuần_phong mỹ_tục (Habits and Customs) 8.404755120045843E8 Bin Laden (Bin Laden) 5.938943787195972E8 Bộ Vi_xử_lý (Center Processing) 3.5782968749839115E8 Thép_miền_nam Cảng_Sài_gòn (a football club) 2.5598593044043452E8 Trận chung_kết (Final Match) 1.939850017618072E8 Đất_khách quê_người (Forein Land) 1.8430912500609657E8 Vạn_lý trường_thành (the Great Wall of China) 1.6699845099865612E8 Đi_tắt đón_đầu (Take a short-cut, Wait in front) 1.0498738800702788E8 Xướng_ca vô_loài 1.0469589600052954E8 Ổ cứng (Hard Disk) 9.693021145946936E7 Sao_mai Điểm_hạn (a music competition) 8.833816801460913E7 Bảng xếp_hạng (Ranking Table) 8.55072554114269E7 Sơ_yếu lý_lịch (Curiculum Vitae) 8.152866670394194E7 Vốn điều_lệ (Charter Capital) 5.578214903954915E7 Xứ_sở sương_mù (England) 4.9596802405895464E7 Windows XP (Windows XP) 4.8020442441390194E7 Thụ_tinh ống_nghiệm (Test-tube Fertilization) 4.750102933435161E7 Outlook Express (Outlook Express) 3.490459668749844E7 Công_nghệ thông_tin (Information Technology) 1587584.1576983468 Hệ_thống thông_tin (Information System) 19716.68246929993 Silicon Valley (Silicon Valley) 1589327.942940336 Relevance Score We borrowed the ideas of simple score and inter-cluster score from [27]. Simple score is the relevance of a label and a specific cluster without concerning the other clusters. Inter- cluster score of a label and a cluster, on the other hand, look at not only the interesting cluster but also other clusters. As a result, the labels chosen using inter-cluster score discriminate clusters better than simple score. In order to get relevance between a label candidate (l) and a cluster (c) using simple score, we use 3 types of features including the topic similarity (topsim) between topic- distribution of the label candidate and the “expected topic distribution” of the cluster, the length of candidate (lenl), number of snippets in the cluster c containing this phrase (cdfl,c). More concretely, given one candidate label lwwwl ...21= , we first inference the 45 topic distribution lθ for the label also by using the estimated model of the universal dataset. Next, the simple relevance score of l is measured with respect to a cluster cθ by using cosine similarity for topic similarity [48]: (4.4) lclclc lencdflsplscore θ ×+×+×α θ βθ= ,|(cosine),( ) γ A good cluster label is not only relevant to the current cluster but also help to distinguish this cluster to another. So, it is very useful to penalize the reference score of a label with respect to the current cluster by the reference scores of that label to other clusters ( and ). Thus, we get the inter-cluster scoring function as follows: c 'c Cc∈' cc ≠' ∑ ≠∈ ×−= ccCc ccc lsplscorelsplscorelscore ',' ' ),(),(),( θμθθ (4.5) The candidate labels of a cluster are sorted - in descending order - by its relevance and the 4 most relevant candidates are then chosen as labels for the cluster. d. Ranking within Cluster We reorganize the order of documents in each cluster by the relevance between its topic distribution and the “expected topic distribution” of the cluster. If the relevant measures of 2 snippets within one cluster are the same, their old ranks in the complete list determined by Google are used to specify the final ranks. In other words, if the 2 snippets have the same relevant score, the one with higher old rank has higher rank in the considered cluster. 4.2.2. Experiments a. Experimental Settings For experiments, we first submit 10 ambiguous queries to Google and get back about 200 search result snippets [Table 4.5]. The reason why we choose these queries is that they are likely to contain multi

Các file đính kèm theo tài liệu này:

MSc08_Nguyen_Cam_Tu_Thesis_English.pdf