Detecting Vietnamese spams using a multi-Objective evolutionary approach - Nguyen Long

Tài liệu Detecting Vietnamese spams using a multi-Objective evolutionary approach - Nguyen Long: Nghiên cứu khoa học công nghệ Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2017 203 DETECTING VIETNAMESE SPAMS USING A MULTI-OBJECTIVE EVOLUTIONARY APPROACH Nguyen Long1*, Nguyen Duc Dinh2,Le Van Diep2, Vu Minh Tuan3, Tran Quang Anh4, Bui Thu Lam5 Abstract: This paper we proposed the usage of a multi-objective algorithm to solve Vietnamese spam detection problems. The problem taking into account the specific Vietnamese, English and Chinese characterises was analyzed, then a multi- objective problem for the Vietnamese spam detection was modelled. With the usage of multi-objectivity, it helps the users more flexibility on system configuration. The proposed multi-objective optimization approach to find feasible trade-off solutions for a anti-spam email system (using Apache SpamAssassin). Two conflicting objectives are raised: The Spam Detection Rate (SDR) and False Alarm Rate (FAR). The experiments were conducted based on multilingual spam data sets ...

pdf14 trang | Chia sẻ: quangot475 | Lượt xem: 446 | Lượt tải: 0download
Bạn đang xem nội dung tài liệu Detecting Vietnamese spams using a multi-Objective evolutionary approach - Nguyen Long, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
Nghiên cứu khoa học công nghệ Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2017 203 DETECTING VIETNAMESE SPAMS USING A MULTI-OBJECTIVE EVOLUTIONARY APPROACH Nguyen Long1*, Nguyen Duc Dinh2,Le Van Diep2, Vu Minh Tuan3, Tran Quang Anh4, Bui Thu Lam5 Abstract: This paper we proposed the usage of a multi-objective algorithm to solve Vietnamese spam detection problems. The problem taking into account the specific Vietnamese, English and Chinese characterises was analyzed, then a multi- objective problem for the Vietnamese spam detection was modelled. With the usage of multi-objectivity, it helps the users more flexibility on system configuration. The proposed multi-objective optimization approach to find feasible trade-off solutions for a anti-spam email system (using Apache SpamAssassin). Two conflicting objectives are raised: The Spam Detection Rate (SDR) and False Alarm Rate (FAR). The experiments were conducted based on multilingual spam data sets (Vietnamese, English and Chinese) through several scenarios with different numbers of SpamAssassin rules. In our experiments, we used two different multi- objective optimization algorithms and a single objective algorithm for the comparison on this problem. According to the statistical results, the new approach not only achieved more efficient results but also created a set of ready-to-use rule scores which supports different levels of the trade-off between SDR and FAR. The experimental results indicate that the proposed approach gives users more flexibility and efficiency in the Anti-spam Mail System. Keywords: Anti-spam, SpamAssassin, MOEAs, SDR, FAR, NSGA-II, DMEA-II. 1. INTRODUCTION In network security area, researchers focused on stopping spammers from annoying senders by suggest a wide range of detecting and anti-spam solutions with different approaches. There are also a number of factors to evaluate the efficiency of solutions, among them SDR and FAR are most popular criteria to measure the effectiveness of a spam detection resolution. The goal of anti-spam approaches is to maximize SDR and to minimize FAR as much as possible. In this case, the higher rate of detecting spam an approach brings the higher probability to alarm a ham (non-spam email) as spam it gets and vice versa. If a spam detection system works effectiveness, it not expected to gain an absolute optimum which are 100% for SDR and 0% for FAR, but it gives an acceptable trade-off between these criteria. In theses anti-spam system, a threshold is a predefined parameter, two objectives SDR ( or FAR) is calculated to evaluate the effectiveness of the system at trial thresholds. The remainder of this paper is organized as follows. A description of Spam Detection System is given in Section II. The common concepts of MOEAs and briefly description about DMEA-II in Section III. Our methodologyis given in section IV, we presented the experiments and discussion in Section V. Finally, the last section concluded the paper and talked about the future of our works. 2. BACKGROUND 2.1. SpamAssassin One of well-known anti-spam systems is SpamAssassin, which is a Bayesian Công nghệ thông tin N. Long, , B. T. Lam, “Detecting Vietnamese spams using a evolutionary approach.” 204 spam filter. The Bayes based system evaluate the probability of a message being spam based upon its contents, then learns from spam and from good mail (ham), resulting in a very robust, adaptive and efficient anti-spam approach. The proposal system is in its early stages of implementation and will only get better over time. The system uses predefined and custom rules as well as a Bayes database for scoring messages within a numerical range. It scores email based upon various characteristics. The characteristics a Bayesian spam filter can look at components or features of an email: the words in the body of the message, other aspects such as HTML code (like colors), word pairs, phrases and meta information (where a particular phrase appears, for example). The recognizing what is considered spam or not considered spam (ham) is key feature of an spam filter system. The SpamAssassin filter can learn to determine a message which is a spam or not when users specifically mark as spam or not spam by sending them to their junk folder in the web client or via mail clients application. SpamAssassin examines the message represented to it and assign a score to indicate the likelihood that the mail. The system working on the predefined set of rules and a score is assigned to a rule. When an email gains enough the score which is greater than the threshold, the email is marked as spam. A built-in module to score its rules is attached in the system, which works as a single objective optimization method. In this case, a fixed threshold is used then the module scores to decrease the error rate over a given training dataset. A stochastic gradient descent algorithm is used to train a single-layer neural network with a transfer function and a logsig activation function. A rule of the system is presented as a node of the neural network. The input of nodes represent whether or not the rule is activated by an email. The score of the rule is the weight of the corresponding node. In recent years, there is an increasing trend in dealing with multi-objectivity in optimizing rule scores [18,11,19,3]. Obviously, there will be several objectives for this problem, typically SDR and FAR. The contribution in this area will be how to designed a MOEA to solve it and how to deal with language-specific email databases. 2.2. Multi-objective evolutionary optimization In many disciplines, optimization problems often have two or more objectives, which are normally in conflict with others, and that we wish to optimize them simultaneously. These problems are called multi-objective optimization problems (MOPs). In fact, MOPs normally give rise not to one, but to a set of solutions (called a Pareto optimal set (POS)) which, in the absence of any further information, are all equally good. An evolutionary algorithm have been very popular for solving MOPs mainly due to their ease of use, work on population and their wide applicability. Evolutionary algorithms allow to find an entire set of Pareto optimal solutions in a single run of the algorithm, instead of having to perform a series of separate runs as in the case of the traditional mathematical programming techniques. Mathematically, in a k-objective unconstrained (bound constrained) minimization Nghiên cứu khoa học công nghệ Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2017 205 problem, a vector function ⃗(⃗) of k objectives is defined as: ⃗(⃗) = [(⃗), (⃗), , (⃗)] (1) in which ⃗ is a vector of decision variables in v-dimensional ℝ. In EC, ⃗represents an individual in the population to be evolved. The value (⃗), then, describes the performance of individual ⃗ as evaluated against the jth objective in the MOP. An individual ⃗ dominates ⃗ if ⃗ is not worse than ⃗on all k objectives and is better than ⃗ on at least one objective. If ⃗does not dominate ⃗ and ⃗ also does not dominate ⃗, then ⃗ and ⃗ are said to be non-dominated with respect to each other. If we use the symbol “≼” to denote that ⃗ ≼ ⃗means ⃗ dominates ⃗, and the symbol “⋫” between two scalars a and b to indicate that a ⋫b means a is not worse than b, then dominance can be formally defined as [7]: Definition 1 (Dominance): ⃗ ≼ ⃗ if the following conditions are held : 1. (⃗) ⋫ (⃗) ∀ ∈ {1,2, , }; and, 2. ∃ ∈ {1,2, , }: (⃗) ⊲ (⃗). In general, if an individual is not dominated by any other individual in the population, it is called a non-dominated solution. All non-dominated solutions in a population form the non-dominated set as formally described in the following definition: Definition 2 (Non-Dominated Set): A set S is said to be the non-dominated set of a population P if the following conditions are met: 1.S ⊆ P; and, 2. ∀⃗ ∈ ∄⃗ ∈ : ⃗ ≼ ⃗ If P represents the entire search space, then S is referred to as the global Pareto optimal set. If P represents only a sub-space, then S is called the local Pareto optimal set. While there can be multiple local Pareto optimal sets, there exists only one global one. Multi-objective evolutionary algorithms (MOEAs) are stochastic techniques being used to find Pareto optimal solutions for MOPs. There are two key problems that MOEAs have to deal with [7]. The first one is how to get as close as possible to the POF. This is challenging because of the stochasticity of the convergence process. The second one is how to keep solutions diverse. A diverse set of solutions will provide decision makers, designers, etc with more choice. However, working on a set of solutions instead of only one, makes the measurement of MOEA convergence more difficult because one individual’s closeness to the POF does not act as a measure for the entire set. Unsurprisingly, then, convergence and diversity are commonly used performance criteria when optimization algorithms are assessed and compared with each other [22]. To date, many MOEAs have been developed (i.e Pareto Archived Evolution Strategy (PAES)[10], Strength Pareto EA 2 (SPEA2) [21], Pareto frontier DE (PDE)[1], NSGA-II [8], Decomposition based Multi-objective Evolutionary Algorithm (MOEA/D) [20] and Multi-Objective Particle Swarm Optimization (MOPSO) [6], MODE-LD+SS [2] and the Direction based Multi-objective Evolutionary Algorithm (DMEA)[4] are typical examples of elitist MOEAs) and Công nghệ thông tin N. Long, , B. T. Lam, “Detecting Vietnamese spams using a evolutionary approach.” 206 they are usually classified into two broad categories: with and without elitism. Elitist approach is a mechanism to preserve the best individuals, once found, during the optimization process. The concept of elitism was established at an early stage of EC (see, for example, [9]); and to date, it has been widely used in EAs. Elitist approach can be realized either by placing one or more of the best parents directly into the nextgeneration of individuals, or by replacing only those parents that are dominated by their offspring [17]. Elitist MOEAs usually (but not necessarily) employ an external set called the archive to store the non-dominated solutions after each generation. DMEA-II is an elitism MOEA introduced in [13]. In DMEA-II, two types of directional information are maintained and used to perturb the parental population prior to offspring production: convergence and spread. The convergence direction (CD) is defined as the direction from a solution to a better one, CD in MOP is a normalized vector that points from dominated to non-dominated solutions. The spread direction (SD) is defined as the direction between two equivalent solutions, SD in MOP is an un-normalized vector that points from one non-dominated solution to another. In DMEA-II, a bundle of rays are used either emitting uniformly from the estimated ideal point into the part of objective space that contains the Pareto Optimal Front (POF) estimate. In evolutionary processes, the rays are used as reference lines to select particular non-dominated solutions from the combined population. One by one, the rays are scanned and the non-dominated solution closest to a given ray is selected and archived. The system of rays is also used for a specific niching to maintain non-dominated solutions for the main population. 3. METHODOLOGY In this section, we will describe our methodology to approach for detecting spams in a Vietnamese email database and it is also extended for multilingual languages. 3.1. Vietnamese spam detection based on SpamAssassin - (VSDSA) The main task of an anti-spam system is detecting spam from a set of emails, an email is checked, which is usually in plain text, the system is asked the email is spam or not. If the email is spam, the system will drop it or alert spam to the recipients. Vietnamese spam refers to the spam written in Vietnamese language. In network security area, Vietnamese spamming is one of the major problems in the Vietnamese Internet society recently. The systems, which uses the content for their detection methods need to be customized for working properly with Vietnamese language. The content based rules is used in SpamAssassin, which can only work with a specific language. There are some set of rules is made for different languages such as Chinese, Thai or Turkey have been carried out [16],[5] and [14]. For Vietnamese language, M.T. Vu [12] extended the statistical rule-based method proposed in [16] to produce Vietnamese rules as a part of multilingual rules for SpamAssassin. To make Vietnamese rules for the system, two phases are executed as follows:  Rule pattern selection: a set of spam-liked patterns is generated from a Nghiên cứu khoa học công nghệ Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2017 207 training dataset including both spam and ham. It produced a set of SpamAssassin rules (without scores) could be produced in format of rules in SpamAssassin.  Rule scoring: A score is assigned for each rule so that the rules best classify spam and ham from the training dataset. The rule pattern selection phase makes the differences between language rules by splitting emails into meaningful words in specific language. Unlike others which can be identified by blank characters, in Vietnamese, a segmentation technique is used to retrieve the meaningful words in the body partof spam and ham in the training dataset. In Vietnamese, words might consist of more than one single word, so there is no clear boundary such as blank spaces between words in Vietnamese. In our experiments, we used the proposal Vietnamese word segmentation program by H.P. Le (2008) [15] to split an Vietnamese email into words. In practice, the users are dealing with emails not only in a single language, but also in the mix with other languages, it raises multilingual problems for spam detection. With single optimization approach, M.T. Vu (2012)[12] proposed a way to use a single objective optimization algorithm for multilingual rules in SpamAssassin. The training dataset was made by Vietnamese, English and Chinese emails in the phase of splitting email’s contents into meaningful words no matter what language used in emails. It does not depend on language in the rule scoring phase. The scoring process is raised as an optimization problem to maximize the performances of the rules over a training dataset. This paper we attempted to utilize a multi-objective approach in the rule scoring phase instead of using single objective approaches in [16] and [12]. 3.2. An evolutionary multi-objective approach to VSDSA In an anti-spam systems, the primary goal is finding out the optimized tradeoff between values of SDR and FAR with different thresholds. When the set of spam detection rules remains unchanged, the values of SDR and FAR become a pair that being the most criterion to be optimized at a specific threshold. With multiple thresholds, the score of rules (be optimized for corresponding threshold) are no longer optimized for the current threshold, so the spam detection and false alarm rate are not optimized anymore. In proposed system, for multi-objective approach, two objectives are considered: SDR must be maximized and FAR must be minimized during optimal process. The steps by steps are described as follows: • Step 1:Initialize The goal of the program is finding out a set of ideal scores called where = (, , ), = (31, 51, 101), ∈ [2,5], ∈ [0,2] At the first, the set of will be generated randomly with a random seed in decision space with ranges. A chromosome presented a value in the set, the first value is generated in ranges of (2, 5) (A practiced parameters for threshold), an email is considered as spam at these values. The other values are generated in ranges of (0, 2), they are presented the score of Công nghệ thông tin N. Long, , B. T. Lam, “Detecting Vietnamese spams using a evolutionary approach.” 208 SpamAssassin rules. In the experimental setup, three case studies are carried out: 30 rules and 1 threshold ( = 31), 50 rules and 1 threshold ( = 51), 100 rules and 1 threshold ( = 101), m is number of . • Step 2: Objective function creation The objective functions are designed on the spam dataset S (Spam data sets): = {, , , } and ham dataset H (Ham data sets): = {ℎ, ℎ, , ℎ} The set of N rules is pre-designed based on the framework in [16]: = {, , , } A mating function is designed to match each rule with some spams of hams: (, ) = 1 __ℎℎ_ 0 ℎ (2) Where ∈ , ∈ {, }. The set of rules with randomly-generated scores (from step 1) is evaluated by SpamAssassin against the dataset and . The score sets bring the best results would be selected as a solution for this multi-objective problem. At a threshold , a function is designed to detect spam is implemented as follows: - Input is an email , output is 1 if is spam, else 0 - Is_spam() { score = 0; for i= 0 to N score += m(r,e)*score_of_r if(score > T) then return 1 else return 0 } • Step 3: Objective function calculation Two objective functions are calculated as follows: = ∑ _( ) (3) = ∑ _( ) (4) Here, he SDR objective of this specific problem is reformulate as (1 - SDR) for all objectives are supposedminimized. • Step 4: MOEA execution After all data input and required parameters are ready, a MOEA is used (DMEA-II or NSGA-II) to run and figure out the best population. Based on that population, the final result would be evaluated and compared. Nghiên cứu khoa học công nghệ Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2017 209 4. EXPERIMENTS The first case study, the proposed system is experimented on two Vietnamese email databases with 272 and 426 emails, respectively. Then, the system is executed on multilingual email database (Vietnamese, Chinese and English) with 286 emails. Two different numbers of rules: 30 and 100 are used in experiments. By the series of case studies, both rules and data scales are carried out. The results will be analyzed on aspects of numerical values of SDR and FAR as well as the multi-objectivity. 4.1. Experimental Parameters For the sake of simplicity, we keep parameters are unchanged DMEA-II and NSGA-II: population size (100), number of generations (1000), number of objective (2), number of real variables (+ 1 – is the number of rules), the range of real variables: ([2,5] for the first value, [0,2] for the others), probability of crossover (0.9), probability of mutation (1). In the experiments, both NSGA-II and DMEA-II were tested 30 times with different random seeds. We did our experiments on a Dell PowerEdge R710 Server L5520 2x2.26GHz Quad Core 24GB RAM, 2x146GB storage running Ubuntu OS v13.04 (at the Software Technology Lab of Le Quy Don Technical University). 4.2. Results and Discussion At the end of experiments for each set of rules, the results were recorded for analyzing. Further results gained from MOEAs were compared to that from the experiments using the single objective optimization algorithm (SOOA) [12]. 4.2.1. Experiments on the database of 272 emails The purpose of using this database is to test the ability of the method to deal with the problem when the database is quite small. Statistical results are visualized in Figure 1 from the experiments with either the small problem size of 30 rules (meaning the chromosome’s size is 31) or the larger one with 100 rules: they are solutions found by the algorithms. Obviously, MOEAs found better solutions than SOOA did. They found not only solutions with zero FAR as did by SOOA, but the ones with higher SDR values than that of SDR. For example, in the case of 30 rules, in term of minimizing FAR (at 0%), the best solution recorded for SDR was around 62% for SDR (with both NSGA-II and DMEA-II) while that result for SOOA (Table 1, 2) is only 40.3%. Among solutions which FAR values are around 10%, SDR values of MOEAs are also much better than SOOA’s. They are {(74.03%, 7.79%); (74.46%, 8.66%); (72.29%, 6.93%)} in comparison to the best point {(67.1%, 9.6%)} of SOOA. Note that in Table 1, we reported all solutions found by SOOA with manually using different thresholds. When the threshold increased, less solutions are classified as spam. It might cause spams not being recognized and hence FAR values reduced. This makes it difficult for the users to decide the solution. MOEAs will provides an automatic way to find the solutions. Công nghệ thông tin N. Long, , B. T. Lam, “Detecting Vietnamese spams using a evolutionary approach.” 210 In terms of multi-objectivity, the trade-off solutions found by MOEAs were widely spread in the objective space as oppose to that of SOOA: its skewness towards SDR. This means a strong multi-objectivity in this problem and the need to address by a multi-objective approach. With this set of trade-off solutions, the users will have more good choices for the system. Regarding the performance of two MOEAs, it seems that DMEA-II provided better the set of solutions; the direction guided approach works better in guiding the search. 4.2.2. Experiments on the database of 426 emails We extended our experiments with a larger set of emails (size of 426). With this large set of emails, we expect that the system will have more information for evaluating solutions. The obtained solutions are visually shown in Figure 2. We also reported the results found by SOOA in Table 3, 4. Again, the result in this case once more confirm our previous findings in the case of 272 emails: Figure 1. The result of experiments using NSGA-II, DMEA-II and SOOA with 30 and 100 rules (the database of 272 emails). Table 1. The numerical results of experiments using SOOA with 30 and 100 rules (the database of 272 emails). Threshold SDR FAR 0.5 67.1% 9.6% 1 67.1% 9.6% 1.5 55.8% 0.8% 2 55.8% 0.8% 2.5 40.3% 0.0% 3 39.8% 0.0% 3.5 8.7% 0.0% 4 6.9% 0.0% 4.5 2.6% 0.0% Nghiên cứu khoa học công nghệ Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2017 211 Table 2. The numerical results of experiments using SOOA with 100 rules (the database of 272 emails). Threshold SDR FAR 0.5 86.1% 15.9% 1 78.4% 12.0% 1.5 72.7% 4.0% 2 62.3% 0.8% 2.5 50.6% 0.0% 3 45.5% 0.0% 3.5 19.0% 0.0% 4 10.0% 0.0% 4.5 7.4% 0.0% Table 3. The numerical result of experiments using SOOA with 30 rules (the database of 426 emails). Threshold SDR FAR 0.5 23.8% 0.8% 1 2.8% 0.2% 1.5 0.6% 0% 2 0.4% 0% 2.5 0% 0% 3 0% 0% 3.5 0% 0% 4 0% 0% 4.5 0% 0% Table 4. The numerical result of experiments using SOOA with 100 rules (the database of 426 emails). Threshold SDR FAR 0.5 24.4% 0.8% 1 5.4% 0.2% 1.5 2.2% 0% 2 2.2% 0% 2.5 0.2% 0% 3 0.2% 0% 3.5 0% 0% 4 0% 0% 4.5 0% 0% Công nghệ thông tin N. Long, , B. T. Lam, “Detecting Vietnamese spams using a evolutionary approach.” 212 Figure 2. The result of experiments using NSGA-II, DMEA-II and SOOA with 30 and 100 rules (the database of 426 emails). • It is clear that the application of multi-objective optimization algorithm to spam detection is reasonable and promising. It can simultaneously offer the users a set of solutions trading-off on SDR ad FAR. With this approach, the users do not need to worry about the threshold, but issuing how is the importance of either SDR or FAR to them? • The illustration also pointed out that the more set of rules the algorithms working on, the better results it achieved. • The set of obtained solutions by DMEA-II was uniformly distributed along the POF and got closer to the POF than set of solutions by NSGA-II. It means, on this real problem, DMEA-II is quite good in keeping balance of convergence and diversity of the population, an important feature of a MOEA. 4.2.3. Experiments on the multilingual email database We did the extra experiment on the SpamAssassin using the multilingual email database (Vietnamese, Chinese and English) with 286 emails. With this multilingual set of emails, we expect that the system will be useful in case of multiple languages. The obtained solutions are visually shown in Figure 3. We also reported the results found by SOOA in Table 5, 6. The result in this case, the proposed approach not only obtained efficient results on single Vietnamese language, it also archived competitive results on multilingual dataset (including Vietnamese, Chinese and English). It indicates that proposed approach gives users more flexibility and efficiency in the Anti-spam Mail System on multiple languages. Obviously, in case of multilingual emails, the result in this case once more confirm our findings: The set of obtained solutions found by MOEAs were widely spread in the objective space as oppose to that of SOOA. It means, the users will have more good choices for the system. Nghiên cứu khoa học công nghệ Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2017 213 Figure 3. The result of experiments using NSGA-II, DMEA-II and SOOA (the database of 286 multilingual emails). Table 5. The numerical result of experiments using SOOA with 30 rules (in the database of 286 multilingual emails). Threshold SDR FAR 0.5 23.8% 0.8% 1 2.8% 0.2% 1.5 0.6% 0% 2 0.4% 0% 2.5 0% 0% 3 0% 0% 3.5 0% 0% 4 0% 0% 4.5 0% 0% Table 6. The numerical result of experiments using SOOA with 100 rules (in the database of 286 multilingual emails). Threshold SDR FAR 0.5 49.7% 2.9% 1 45% 1.4% 1.5 29.6% 0% 2 19.8% 0% 2.5 13.2% 0% 3 9.9% 0% 3.5 9.1% 0% 4 8.6% 0% 4.5 8.6% 0% Once again, the set of obtained solutions by DMEA-II was uniformly distributed along the POF and got closer to the POF than set of solutions by NSGA-II. It more confirm our findins that, DMEA-II has an important feature of a MOEA: It is quite good in keeping balance of convergence and diversity of the population. Công nghệ thông tin N. Long, , B. T. Lam, “Detecting Vietnamese spams using a evolutionary approach.” 214 5. CONCLUSIONS In this paper, we proposed a framework which applied multi-objective optimization algorithms (NSGA-II and DMEA-II) to solve the problem of a SpamAssassin based anti-spam mail system for Vietnamese and multilingual emails. In practise, two popular objectives (spam detection rate and the false alarm rate) are optimized in recent anti-spam approaches. However, the model of using a single objective optimization is known as traditional methods. Instead of optimizing only one objective, with multi-objective optimization, not only one pair of SDR and FAR for each threshold has been worked out but a set of solutions with different tradeoff levels are computed. The set of obtained solutions are feasible set depending on specific email users’ demands. Especially, the score set of obtained feasible solutions are always ready to used without any trainings. The proposed framework is quite competitive with previous approaches, but the proposal remains some issues which need more efforts to resolve in the future: Firstly, it is the problem of runtime, it needs a measurement for evaluating the runtime of the system especially when using bigger dataset. This concern should be investigated seriously with the expanding of experimental datasets. Secondly, the effectiveness of the system might depends on the performance of selected MOEAs. The system should be experimented with more MOEAs for more diverse results. In the future, we will design an interactive MOEAs for this problem to allow Decision Maker (DM) to interact with optimal processing for controlling population to be converged on her/his preferred areas. The obtained solutions help DM to adjust parameters of SpamAssassin system for her/his expected. REFERENCES [1]. H. Abbass, R. Sarker, and C. Newton, “PDE: A pareto-frontier differential evolution approach for multi-objective optimization problems”, in Proceedings of the IEEE Congress on Evolutionary Computation (CEC2001), vol. 2. Piscataway, NJ: IEEE Press, 2001, pp. 971–978. [2]. C. C. C. M.-M. E. Arias Montano, A., “Mode-ld+ss: A novel differential evolution algorithm incorporating local dominance and scalar selection mechanisms for multi-objective optimization”, In: 2010 IEEE Congress on Evol Comp (CEC2010), 2010. [3]. V. Basto-Fernandes, I. Yevseyeva, and J. R. M´endez, “Optimization of anti- spam systems with multiobjective evolutionary algorithms”, Inf. Resour. Manage. J., vol. 26, no. 1, pp. 54–67, Jan. 2000. [4]. L. T. Bui, J. Liu, A. Bender, M. Barlow, S. Wesolkowski, and H. A. Abbass, “Dmea: a direction- based multiobjective evolutionary algorithm”, Memetic Computing, pp. 271–285, 2011. [5]. K. P. C. Na Songkhla, “Statistical rules for thai spam detection”, Proceeding of: Future Networks, pp. 178–184, 2010. [6]. C. A. C. Coello, G. T. Pulido, and M. S. Lechuga, “Handling multiple objectives with particle swarm optimization”, IEEE Trans. one Evolutionary Computation, vol. 8, no. 3, pp. 256–279, 2004. Nghiên cứu khoa học công nghệ Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2017 215 [7]. K. Deb, “Multiobjective Optimization using Evolutionary Algorithms”, New York: John Wiley and Son Ltd, 2001. [8]. K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II”, IEEE Trans. on Evolutionary Computation, vol. 6, no. 2, pp. 182–197, 2002. [9]. K. A. DeJong, “An analysis of the behavior of a class of genetic adaptive systems”, Ph.D. dissertation, University of Michigan, Ann Arbor, 1975. [10]. J. Knowles and D. Corne, “Approximating the nondominated front using the pareto archived evolution strategy”, Evolutionary Computation, vol. 8, no. 2, pp. 149–172, 2000. [11]. A. L´opez-Herrera, E. Herrera-Viedma, and F. Herrera, “A multiobjective evolutionary algorithm for spam e-mail filtering”, in Intelligent System and Knowledge Engineering, 2008. ISKE 2008. 3rd International Conference on, vol. 1. IEEE, 2008, pp. 366–371. [12]. F. J. V. T. M.T. Vu, Q.A. Tran, “Multilingual rules for spam detection,” Proceedings of the 7th International Conference on Broadband and Biomedical Communications (IB2COM 2012), pp. 106–110, 2012. [13]. L. Nguyen, L. T. Bui, and H. A. Abbass, “Dmea-ii: the direction-based multi-objective evolu- tionary algorithm-ii”, Soft Computing, vol. 18, no. 11, pp. 2119–2134, 2014. [14]. L. O¨ zgu¨r, T. Gu¨ngo¨r, and F. S. Gu¨rgen, “Adaptive anti-spam filtering for agglutinative lan- guages: a special case for turkish”, Pattern Recognition Letters, vol. 25, no. 16, pp. 1819–1831, 2004. [15]. H. N. Phuong L.H, A. Roussanaly, and H. T. Vinh, “A hybrid approach to word segmentation of vietnamese texts”, vol. 5196, pp. 240– 249, 2008. [Online]. Available: 423 [16]. X. L. Q.A. Tran, H. Duan, “Real-time statistical rules for spam detection”, Proceedings of the International Journal of Computer Science and Network Security, pp. 178–184, 2006. [17]. R. Storn and K. Price, “Differential evolution - a simple and efficient adaptive scheme for global optimization over continuous spaces”, technical report pp-95-012, ICSI, Tech. Rep., 1995. [18]. I. Yevseyeva, V. Basto-Fernandes, and J. R. M´endez, “Survey on anti- spam single and multi- objective optimization”, in ENTERprise Information Systems. Springer, 2011, pp. 120–129. [19]. I. Yevseyeva, V. Basto-Fernandes, D. Ruano-Ord´as, and J. R. M´endez, “Optimising anti-spam filters with evolutionary algorithms”, Expert Systems with Applications, 2013. [20]. Q. F. Zhang and H. Li., “Moea/d: A multi-objective evolutionary algorithm based on decomposition”, 2007. [21]. E. Zitzler, M. Laumanns, and L. Thiele, “SPEA2: Improving the strength pareto evolutionary algorithm for multiobjective optimization”, in Evolutionary Methods for Design Optimization and Control with Applications to Industrial Problems, K. C. Giannakoglou, D. T. Tsahalis, J. Công nghệ thông tin N. Long, , B. T. Lam, “Detecting Vietnamese spams using a evolutionary approach.” 216 P. eriaux, K. D. Papailiou, and T. Fogarty, Eds. Int. Center for Numerical Methods in Engineering (Cmine), 2001, pp. 95–100. [22]. E. Zitzler, L. Thiele, and K. Deb, “Comparision of multiobjective evolutionary algorithms: Emprical results”, Evolutionary Computation, vol. 8, no. 1, pp. 173–195, 2000. TÓM TẮT PHÁT HIỆN THƯ RÁC TIẾNG VIỆT BẰNG PHƯƠNG PHÁP SỬ DỤNG GIẢI THUẬT TIẾN HÓA TỐI ƯU ĐA MỤC TIÊU Bài báo giới thiệu việc sử dụng giải thuật đa mục tiêu để giải quyết bài toán phát hiện thư rác tiếng Việt. Bái toán được mở rộng và phân tích đặc điểm tiếng Việt, tiếng Anh, tiếng Trung Quốc sai đó sử dụng bài toán tối ưu đa mục tiêu cho việc phát hiện thư rác đươc mô hình hóa. Bằng việc sử dụng tối ưu đa mục tiêu, người sử dụng, quản trị sẽ linh hoạt trong việc cấu hình hệ thống. Phương pháp tiếp cận tối ưu đa mục tiêu để tìm tập giải pháp khả thi cho hệ thống lọc thư rác ( sử dụng hệ thống Apache SpamAssassin). Hai mục tiêu được đưa ra là: Tỷ lệ phát hiện thư rác (Spam Detection Rate – SDR) và tỷ lệ cảnh báo nhầm (False Alarm Rate – FAR). Thí nghiệm được thực hiện dựa trên dữ liệu đa ngôn ngữ (tiếng Việt, tiếng Anh và tiếng Trung Quốc) thông qua một số hoạt cảnh với số lượng luật SpamAssassin khác nhau. Trong thí nghiệm, hai giải thuật tiến hóa tối ưu đa mục tiêu được sử dụng (DMEA-II và NSGA-II) và một giải thuật đơn mục tiêu để so sánh hiệu quả của đề xuất. Theo phân tích kết quả, phương pháp mới không tạo ra hiệu quả tốt hơn mà còn tao ra một tập luật có thể sử dụng với các mức thỏa hiệp khác nhau của SDR và FAR. Kết quả thí nghiệm chỉ ra rằng với việc ứng dụng tiếp cận tối ưu đa mục tiêu với tập đa ngôn ngữ giúp cho người dùng có thể cấu hình hệ thống đảm bảo hoạt động linh hoạt và hiệu quả trong việc phát hiện thư rác cho các tổ chức, cơ quan, doanh nghiệp. Tõ khãa: Thư rác, SpamAssassin, Tối ưu đa mục tiêu, SDR, FAR, NSGA-II, DMEA-II. Nhận bài ngày 16 tháng 8 năm 2017 Hoàn thiện ngày 26 tháng 11 năm 2017 Chấp nhận đăng ngày 28 tháng 11 năm 2017 Address: 1National Defense Academy; 2MITI, Military Academy of Science and Technology; 3Hanoi University; 4Posts and Telecommunications Institute of Technology; 5Le Quy Don Technical University. * Email: longn@mta.edu.vn.

Các file đính kèm theo tài liệu này:

  • pdf21_7427_2151893.pdf
Tài liệu liên quan