Tài liệu Detecting Vietnamese spams using a multi-Objective evolutionary approach - Nguyen Long: Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2017 203
DETECTING VIETNAMESE SPAMS USING A MULTI-OBJECTIVE
EVOLUTIONARY APPROACH
Nguyen Long1*, Nguyen Duc Dinh2,Le Van Diep2,
Vu Minh Tuan3, Tran Quang Anh4, Bui Thu Lam5
Abstract: This paper we proposed the usage of a multi-objective algorithm to
solve Vietnamese spam detection problems. The problem taking into account the
specific Vietnamese, English and Chinese characterises was analyzed, then a multi-
objective problem for the Vietnamese spam detection was modelled. With the usage
of multi-objectivity, it helps the users more flexibility on system configuration. The
proposed multi-objective optimization approach to find feasible trade-off solutions
for a anti-spam email system (using Apache SpamAssassin). Two conflicting
objectives are raised: The Spam Detection Rate (SDR) and False Alarm Rate
(FAR). The experiments were conducted based on multilingual spam data sets
...
14 trang |
Chia sẻ: quangot475 | Lượt xem: 419 | Lượt tải: 0
Bạn đang xem nội dung tài liệu Detecting Vietnamese spams using a multi-Objective evolutionary approach - Nguyen Long, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2017 203
DETECTING VIETNAMESE SPAMS USING A MULTI-OBJECTIVE
EVOLUTIONARY APPROACH
Nguyen Long1*, Nguyen Duc Dinh2,Le Van Diep2,
Vu Minh Tuan3, Tran Quang Anh4, Bui Thu Lam5
Abstract: This paper we proposed the usage of a multi-objective algorithm to
solve Vietnamese spam detection problems. The problem taking into account the
specific Vietnamese, English and Chinese characterises was analyzed, then a multi-
objective problem for the Vietnamese spam detection was modelled. With the usage
of multi-objectivity, it helps the users more flexibility on system configuration. The
proposed multi-objective optimization approach to find feasible trade-off solutions
for a anti-spam email system (using Apache SpamAssassin). Two conflicting
objectives are raised: The Spam Detection Rate (SDR) and False Alarm Rate
(FAR). The experiments were conducted based on multilingual spam data sets
(Vietnamese, English and Chinese) through several scenarios with different
numbers of SpamAssassin rules. In our experiments, we used two different multi-
objective optimization algorithms and a single objective algorithm for the
comparison on this problem. According to the statistical results, the new approach
not only achieved more efficient results but also created a set of ready-to-use rule
scores which supports different levels of the trade-off between SDR and FAR. The
experimental results indicate that the proposed approach gives users more
flexibility and efficiency in the Anti-spam Mail System.
Keywords: Anti-spam, SpamAssassin, MOEAs, SDR, FAR, NSGA-II, DMEA-II.
1. INTRODUCTION
In network security area, researchers focused on stopping spammers from
annoying senders by suggest a wide range of detecting and anti-spam solutions
with different approaches. There are also a number of factors to evaluate the
efficiency of solutions, among them SDR and FAR are most popular criteria to
measure the effectiveness of a spam detection resolution. The goal of anti-spam
approaches is to maximize SDR and to minimize FAR as much as possible. In this
case, the higher rate of detecting spam an approach brings the higher probability to
alarm a ham (non-spam email) as spam it gets and vice versa. If a spam detection
system works effectiveness, it not expected to gain an absolute optimum which are
100% for SDR and 0% for FAR, but it gives an acceptable trade-off between these
criteria. In theses anti-spam system, a threshold is a predefined parameter, two
objectives SDR ( or FAR) is calculated to evaluate the effectiveness of the system
at trial thresholds. The remainder of this paper is organized as follows. A
description of Spam Detection System is given in Section II. The common
concepts of MOEAs and briefly description about DMEA-II in Section III. Our
methodologyis given in section IV, we presented the experiments and discussion in
Section V. Finally, the last section concluded the paper and talked about the future
of our works.
2. BACKGROUND
2.1. SpamAssassin
One of well-known anti-spam systems is SpamAssassin, which is a Bayesian
Công nghệ thông tin
N. Long, , B. T. Lam, “Detecting Vietnamese spams using a evolutionary approach.” 204
spam filter. The Bayes based system evaluate the probability of a message being
spam based upon its contents, then learns from spam and from good mail (ham),
resulting in a very robust, adaptive and efficient anti-spam approach. The proposal
system is in its early stages of implementation and will only get better over time.
The system uses predefined and custom rules as well as a Bayes database for
scoring messages within a numerical range. It scores email based upon various
characteristics. The characteristics a Bayesian spam filter can look at components
or features of an email: the words in the body of the message, other aspects such as
HTML code (like colors), word pairs, phrases and meta information (where a
particular phrase appears, for example).
The recognizing what is considered spam or not considered spam (ham) is key
feature of an spam filter system. The SpamAssassin filter can learn to determine a
message which is a spam or not when users specifically mark as spam or not spam
by sending them to their junk folder in the web client or via mail clients
application.
SpamAssassin examines the message represented to it and assign a score to
indicate the likelihood that the mail. The system working on the predefined set of
rules and a score is assigned to a rule. When an email gains enough the score
which is greater than the threshold, the email is marked as spam. A built-in
module to score its rules is attached in the system, which works as a single
objective optimization method. In this case, a fixed threshold is used then the
module scores to decrease the error rate over a given training dataset. A stochastic
gradient descent algorithm is used to train a single-layer neural network with a
transfer function and a logsig activation function. A rule of the system is presented
as a node of the neural network. The input of nodes represent whether or not
the rule is activated by an email. The score of the rule is the weight of the
corresponding node.
In recent years, there is an increasing trend in dealing with multi-objectivity in
optimizing rule scores [18,11,19,3]. Obviously, there will be several objectives for
this problem, typically SDR and FAR. The contribution in this area will be how to
designed a MOEA to solve it and how to deal with language-specific email
databases.
2.2. Multi-objective evolutionary optimization
In many disciplines, optimization problems often have two or more objectives,
which are normally in conflict with others, and that we wish to optimize them
simultaneously. These problems are called multi-objective optimization problems
(MOPs). In fact, MOPs normally give rise not to one, but to a set of solutions
(called a Pareto optimal set (POS)) which, in the absence of any further
information, are all equally good. An evolutionary algorithm have been very
popular for solving MOPs mainly due to their ease of use, work on population and
their wide applicability. Evolutionary algorithms allow to find an entire set of
Pareto optimal solutions in a single run of the algorithm, instead of having to
perform a series of separate runs as in the case of the traditional mathematical
programming techniques.
Mathematically, in a k-objective unconstrained (bound constrained) minimization
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2017 205
problem, a vector function ⃗( ⃗) of k objectives is defined as:
⃗( ⃗) = [ ( ⃗), ( ⃗), , ( ⃗)] (1)
in which ⃗ is a vector of decision variables in v-dimensional ℝ . In EC,
⃗represents an individual in the population to be evolved. The value ( ⃗), then,
describes the performance of individual ⃗ as evaluated against the jth objective in
the MOP.
An individual ⃗ dominates ⃗ if ⃗ is not worse than ⃗ on all k objectives and
is better than ⃗ on at least one objective. If ⃗ does not dominate ⃗ and ⃗ also
does not dominate ⃗ , then ⃗ and ⃗ are said to be non-dominated with respect to
each other. If we use the symbol “≼” to denote that ⃗ ≼ ⃗ means ⃗ dominates
⃗ , and the symbol “⋫” between two scalars a and b to indicate that a ⋫b means a
is not worse than b, then dominance can be formally defined as [7]:
Definition 1 (Dominance): ⃗ ≼ ⃗ if the following conditions are held :
1. ( ⃗ ) ⋫ ( ⃗ ) ∀ ∈ {1,2, , }; and,
2. ∃ ∈ {1,2, , }: ( ⃗ ) ⊲ ( ⃗ ).
In general, if an individual is not dominated by any other individual in the
population, it is called a non-dominated solution. All non-dominated solutions in a
population form the non-dominated set as formally described in the following
definition:
Definition 2 (Non-Dominated Set): A set S is said to be the non-dominated set
of a population P if the following conditions are met:
1.S ⊆ P; and,
2. ∀ ⃗ ∈ ∄ ⃗ ∈ : ⃗ ≼ ⃗
If P represents the entire search space, then S is referred to as the global Pareto
optimal set. If P represents only a sub-space, then S is called the local Pareto
optimal set. While there can be multiple local Pareto optimal sets, there exists only
one global one.
Multi-objective evolutionary algorithms (MOEAs) are stochastic techniques
being used to find Pareto optimal solutions for MOPs. There are two key problems
that MOEAs have to deal with [7]. The first one is how to get as close as possible
to the POF. This is challenging because of the stochasticity of the convergence
process. The second one is how to keep solutions diverse. A diverse set of
solutions will provide decision makers, designers, etc with more choice. However,
working on a set of solutions instead of only one, makes the measurement of
MOEA convergence more difficult because one individual’s closeness to the POF
does not act as a measure for the entire set. Unsurprisingly, then, convergence and
diversity are commonly used performance criteria when optimization algorithms
are assessed and compared with each other [22].
To date, many MOEAs have been developed (i.e Pareto Archived Evolution
Strategy (PAES)[10], Strength Pareto EA 2 (SPEA2) [21], Pareto frontier DE
(PDE)[1], NSGA-II [8], Decomposition based Multi-objective Evolutionary
Algorithm (MOEA/D) [20] and Multi-Objective Particle Swarm Optimization
(MOPSO) [6], MODE-LD+SS [2] and the Direction based Multi-objective
Evolutionary Algorithm (DMEA)[4] are typical examples of elitist MOEAs) and
Công nghệ thông tin
N. Long, , B. T. Lam, “Detecting Vietnamese spams using a evolutionary approach.” 206
they are usually classified into two broad categories: with and without elitism.
Elitist approach is a mechanism to preserve the best individuals, once found,
during the optimization process. The concept of elitism was established at an early
stage of EC (see, for example, [9]); and to date, it has been widely used in EAs.
Elitist approach can be realized either by placing one or more of the best parents
directly into the nextgeneration of individuals, or by replacing only those parents
that are dominated by their offspring [17]. Elitist MOEAs usually (but not
necessarily) employ an external set called the archive to store the non-dominated
solutions after each generation.
DMEA-II is an elitism MOEA introduced in [13]. In DMEA-II, two types of
directional information are maintained and used to perturb the parental population
prior to offspring production: convergence and spread. The convergence direction
(CD) is defined as the direction from a solution to a better one, CD in MOP is a
normalized vector that points from dominated to non-dominated solutions. The
spread direction (SD) is defined as the direction between two equivalent solutions,
SD in MOP is an un-normalized vector that points from one non-dominated
solution to another.
In DMEA-II, a bundle of rays are used either emitting uniformly from the
estimated ideal point into the part of objective space that contains the Pareto
Optimal Front (POF) estimate. In evolutionary processes, the rays are used as
reference lines to select particular non-dominated solutions from the combined
population. One by one, the rays are scanned and the non-dominated solution
closest to a given ray is selected and archived. The system of rays is also used for a
specific niching to maintain non-dominated solutions for the main population.
3. METHODOLOGY
In this section, we will describe our methodology to approach for detecting
spams in a Vietnamese email database and it is also extended for multilingual
languages.
3.1. Vietnamese spam detection based on SpamAssassin - (VSDSA)
The main task of an anti-spam system is detecting spam from a set of emails, an
email is checked, which is usually in plain text, the system is asked the email is
spam or not. If the email is spam, the system will drop it or alert spam to the
recipients. Vietnamese spam refers to the spam written in Vietnamese language. In
network security area, Vietnamese spamming is one of the major problems in the
Vietnamese Internet society recently. The systems, which uses the content for their
detection methods need to be customized for working properly with Vietnamese
language. The content based rules is used in SpamAssassin, which can only work
with a specific language. There are some set of rules is made for different
languages such as Chinese, Thai or Turkey have been carried out [16],[5] and
[14]. For Vietnamese language, M.T. Vu [12] extended the statistical rule-based
method proposed in [16] to produce Vietnamese rules as a part of multilingual
rules for SpamAssassin. To make Vietnamese rules for the system, two phases are
executed as follows:
Rule pattern selection: a set of spam-liked patterns is generated from a
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2017 207
training dataset including both spam and ham. It produced a set of
SpamAssassin rules (without scores) could be produced in format of rules in
SpamAssassin.
Rule scoring: A score is assigned for each rule so that the rules best classify
spam and ham from the training dataset.
The rule pattern selection phase makes the differences between language rules
by splitting emails into meaningful words in specific language. Unlike others
which can be identified by blank characters, in Vietnamese, a segmentation
technique is used to retrieve the meaningful words in the body partof spam and
ham in the training dataset. In Vietnamese, words might consist of more than one
single word, so there is no clear boundary such as blank spaces between words in
Vietnamese. In our experiments, we used the proposal Vietnamese word
segmentation program by H.P. Le (2008) [15] to split an Vietnamese email into
words.
In practice, the users are dealing with emails not only in a single language, but
also in the mix with other languages, it raises multilingual problems for spam
detection. With single optimization approach, M.T. Vu (2012)[12] proposed a way
to use a single objective optimization algorithm for multilingual rules in
SpamAssassin. The training dataset was made by Vietnamese, English and
Chinese emails in the phase of splitting email’s contents into meaningful words no
matter what language used in emails. It does not depend on language in the rule
scoring phase. The scoring process is raised as an optimization problem to
maximize the performances of the rules over a training dataset. This paper we
attempted to utilize a multi-objective approach in the rule scoring phase instead of
using single objective approaches in [16] and [12].
3.2. An evolutionary multi-objective approach to VSDSA
In an anti-spam systems, the primary goal is finding out the optimized tradeoff
between values of SDR and FAR with different thresholds. When the set of spam
detection rules remains unchanged, the values of SDR and FAR become a pair that
being the most criterion to be optimized at a specific threshold. With multiple
thresholds, the score of rules (be optimized for corresponding threshold) are no
longer optimized for the current threshold, so the spam detection and false alarm
rate are not optimized anymore.
In proposed system, for multi-objective approach, two objectives are
considered: SDR must be maximized and FAR must be minimized during optimal
process. The steps by steps are described as follows:
• Step 1:Initialize
The goal of the program is finding out a set of ideal scores called where
= ( , , ), = (31, 51, 101), ∈ [2,5], ∈ [0,2]
At the first, the set of will be generated randomly with a random seed in
decision space with ranges. A chromosome presented a value in the set, the
first value is generated in ranges of (2, 5) (A practiced parameters for
threshold), an email is considered as spam at these values. The other values
are generated in ranges of (0, 2), they are presented the score of
Công nghệ thông tin
N. Long, , B. T. Lam, “Detecting Vietnamese spams using a evolutionary approach.” 208
SpamAssassin rules. In the experimental setup, three case studies are
carried out: 30 rules and 1 threshold ( = 31), 50 rules and 1 threshold (
= 51), 100 rules and 1 threshold ( = 101), m is number of .
• Step 2: Objective function creation
The objective functions are designed on the spam dataset S (Spam data
sets):
= { , , , }
and ham dataset H (Ham data sets):
= {ℎ , ℎ , , ℎ }
The set of N rules is pre-designed based on the framework in [16]:
= { , , , }
A mating function is designed to match each rule with some spams of
hams:
( , ) =
1 _ _ ℎ ℎ _
0 ℎ
(2)
Where ∈ , ∈ { , }. The set of rules with randomly-generated scores
(from step 1) is evaluated by SpamAssassin against the dataset and .
The score sets bring the best results would be selected as a solution for this
multi-objective problem. At a threshold , a function is designed to detect
spam is implemented as follows:
- Input is an email , output is 1 if is spam, else 0
- Is_spam( )
{
score = 0;
for i= 0 to N
score += m(r,e)*score_of_r
if(score > T)
then return 1
else return 0
}
• Step 3: Objective function calculation
Two objective functions are calculated as follows:
=
∑ _ (
)
(3)
=
∑ _ (
)
(4)
Here, he SDR objective of this specific problem is reformulate as (1
- SDR) for all objectives are supposedminimized.
• Step 4: MOEA execution
After all data input and required parameters are ready, a MOEA is used
(DMEA-II or NSGA-II) to run and figure out the best population. Based on
that population, the final result would be evaluated and compared.
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2017 209
4. EXPERIMENTS
The first case study, the proposed system is experimented on two Vietnamese
email databases with 272 and 426 emails, respectively. Then, the system is
executed on multilingual email database (Vietnamese, Chinese and English) with
286 emails. Two different numbers of rules: 30 and 100 are used in experiments.
By the series of case studies, both rules and data scales are carried out. The results
will be analyzed on aspects of numerical values of SDR and FAR as well as the
multi-objectivity.
4.1. Experimental Parameters
For the sake of simplicity, we keep parameters are unchanged DMEA-II and
NSGA-II: population size (100), number of generations (1000), number of
objective (2), number of real variables ( + 1 – is the number of rules), the range
of real variables: ([2,5] for the first value, [0,2] for the others), probability of
crossover (0.9), probability of mutation (1).
In the experiments, both NSGA-II and DMEA-II were tested 30 times with
different random seeds. We did our experiments on a Dell PowerEdge R710 Server
L5520 2x2.26GHz Quad Core 24GB RAM, 2x146GB storage running Ubuntu OS
v13.04 (at the Software Technology Lab of Le Quy Don Technical University).
4.2. Results and Discussion
At the end of experiments for each set of rules, the results were recorded for
analyzing. Further results gained from MOEAs were compared to that from the
experiments using the single objective optimization algorithm (SOOA) [12].
4.2.1. Experiments on the database of 272 emails
The purpose of using this database is to test the ability of the method to deal with
the problem when the database is quite small. Statistical results are visualized in
Figure 1 from the experiments with either the small problem size of 30 rules
(meaning the chromosome’s size is 31) or the larger one with 100 rules: they are
solutions found by the algorithms. Obviously, MOEAs found better solutions than
SOOA did. They found not only solutions with zero FAR as did by SOOA, but the
ones with higher SDR values than that of SDR. For example, in the case of 30 rules,
in term of minimizing FAR (at 0%), the best solution recorded for SDR was around
62% for SDR (with both NSGA-II and DMEA-II) while that result for SOOA
(Table 1, 2) is only 40.3%. Among solutions which FAR values are around
10%, SDR values of MOEAs are also much better than SOOA’s. They are
{(74.03%, 7.79%); (74.46%, 8.66%); (72.29%, 6.93%)} in comparison to
the best point {(67.1%, 9.6%)} of SOOA. Note that in Table 1, we reported all
solutions found by SOOA with manually using different thresholds. When the
threshold increased, less solutions are classified as spam. It might cause spams not
being recognized and hence FAR values reduced. This makes it difficult for the users
to decide the solution. MOEAs will provides an automatic way to find the solutions.
Công nghệ thông tin
N. Long, , B. T. Lam, “Detecting Vietnamese spams using a evolutionary approach.” 210
In terms of multi-objectivity, the trade-off solutions found by MOEAs were widely
spread in the objective space as oppose to that of SOOA: its skewness towards SDR.
This means a strong multi-objectivity in this problem and the need to address by a
multi-objective approach. With this set of trade-off solutions, the users will have
more good choices for the system. Regarding the performance of two MOEAs, it
seems that DMEA-II provided better the set of solutions; the direction guided
approach works better in guiding the search.
4.2.2. Experiments on the database of 426 emails
We extended our experiments with a larger set of emails (size of 426). With this
large set of emails, we expect that the system will have more information for
evaluating solutions. The obtained solutions are visually shown in Figure 2. We
also reported the results found by SOOA in Table 3, 4. Again, the result in this
case once more confirm our previous findings in the case of 272 emails:
Figure 1. The result of experiments using NSGA-II, DMEA-II and SOOA with 30
and 100 rules (the database of 272 emails).
Table 1. The numerical results of experiments using SOOA with 30 and 100 rules
(the database of 272 emails).
Threshold SDR FAR
0.5 67.1% 9.6%
1 67.1% 9.6%
1.5 55.8% 0.8%
2 55.8% 0.8%
2.5 40.3% 0.0%
3 39.8% 0.0%
3.5 8.7% 0.0%
4 6.9% 0.0%
4.5 2.6% 0.0%
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2017 211
Table 2. The numerical results of experiments using SOOA with 100 rules
(the database of 272 emails).
Threshold SDR FAR
0.5 86.1% 15.9%
1 78.4% 12.0%
1.5 72.7% 4.0%
2 62.3% 0.8%
2.5 50.6% 0.0%
3 45.5% 0.0%
3.5 19.0% 0.0%
4 10.0% 0.0%
4.5 7.4% 0.0%
Table 3. The numerical result of experiments using SOOA with 30 rules
(the database of 426 emails).
Threshold SDR FAR
0.5 23.8% 0.8%
1 2.8% 0.2%
1.5 0.6% 0%
2 0.4% 0%
2.5 0% 0%
3 0% 0%
3.5 0% 0%
4 0% 0%
4.5 0% 0%
Table 4. The numerical result of experiments using SOOA with 100 rules
(the database of 426 emails).
Threshold SDR FAR
0.5 24.4% 0.8%
1 5.4% 0.2%
1.5 2.2% 0%
2 2.2% 0%
2.5 0.2% 0%
3 0.2% 0%
3.5 0% 0%
4 0% 0%
4.5 0% 0%
Công nghệ thông tin
N. Long, , B. T. Lam, “Detecting Vietnamese spams using a evolutionary approach.” 212
Figure 2. The result of experiments using NSGA-II, DMEA-II and SOOA with 30
and 100 rules (the database of 426 emails).
• It is clear that the application of multi-objective optimization algorithm to
spam detection is reasonable and promising. It can simultaneously offer the
users a set of solutions trading-off on SDR ad FAR. With this approach, the
users do not need to worry about the threshold, but issuing how is the
importance of either SDR or FAR to them?
• The illustration also pointed out that the more set of rules the algorithms
working on, the better results it achieved.
• The set of obtained solutions by DMEA-II was uniformly distributed along
the POF and got closer to the POF than set of solutions by NSGA-II. It
means, on this real problem, DMEA-II is quite good in keeping balance of
convergence and diversity of the population, an important feature of a
MOEA.
4.2.3. Experiments on the multilingual email database
We did the extra experiment on the SpamAssassin using the multilingual email
database (Vietnamese, Chinese and English) with 286 emails. With this
multilingual set of emails, we expect that the system will be useful in case of
multiple languages. The obtained solutions are visually shown in Figure 3. We also
reported the results found by SOOA in Table 5, 6. The result in this case, the
proposed approach not only obtained efficient results on single Vietnamese
language, it also archived competitive results on multilingual dataset (including
Vietnamese, Chinese and English). It indicates that proposed approach gives users
more flexibility and efficiency in the Anti-spam Mail System on multiple
languages. Obviously, in case of multilingual emails, the result in this case once
more confirm our findings:
The set of obtained solutions found by MOEAs were widely spread in the
objective space as oppose to that of SOOA. It means, the users will have more
good choices for the system.
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2017 213
Figure 3. The result of experiments using NSGA-II, DMEA-II and SOOA
(the database of 286 multilingual emails).
Table 5. The numerical result of experiments using SOOA with 30 rules
(in the database of 286 multilingual emails).
Threshold SDR FAR
0.5 23.8% 0.8%
1 2.8% 0.2%
1.5 0.6% 0%
2 0.4% 0%
2.5 0% 0%
3 0% 0%
3.5 0% 0%
4 0% 0%
4.5 0% 0%
Table 6. The numerical result of experiments using SOOA with 100 rules
(in the database of 286 multilingual emails).
Threshold SDR FAR
0.5 49.7% 2.9%
1 45% 1.4%
1.5 29.6% 0%
2 19.8% 0%
2.5 13.2% 0%
3 9.9% 0%
3.5 9.1% 0%
4 8.6% 0%
4.5 8.6% 0%
Once again, the set of obtained solutions by DMEA-II was uniformly
distributed along the POF and got closer to the POF than set of solutions by
NSGA-II. It more confirm our findins that, DMEA-II has an important feature of a
MOEA: It is quite good in keeping balance of convergence and diversity of the
population.
Công nghệ thông tin
N. Long, , B. T. Lam, “Detecting Vietnamese spams using a evolutionary approach.” 214
5. CONCLUSIONS
In this paper, we proposed a framework which applied multi-objective
optimization algorithms (NSGA-II and DMEA-II) to solve the problem of a
SpamAssassin based anti-spam mail system for Vietnamese and multilingual
emails. In practise, two popular objectives (spam detection rate and the false alarm
rate) are optimized in recent anti-spam approaches. However, the model of using a
single objective optimization is known as traditional methods. Instead of
optimizing only one objective, with multi-objective optimization, not only one pair
of SDR and FAR for each threshold has been worked out but a set of solutions
with different tradeoff levels are computed. The set of obtained solutions are
feasible set depending on specific email users’ demands. Especially, the score set
of obtained feasible solutions are always ready to used without any trainings.
The proposed framework is quite competitive with previous approaches, but the
proposal remains some issues which need more efforts to resolve in the future:
Firstly, it is the problem of runtime, it needs a measurement for evaluating the
runtime of the system especially when using bigger dataset. This concern should
be investigated seriously with the expanding of experimental datasets. Secondly,
the effectiveness of the system might depends on the performance of selected
MOEAs. The system should be experimented with more MOEAs for more diverse
results.
In the future, we will design an interactive MOEAs for this problem to allow
Decision Maker (DM) to interact with optimal processing for controlling
population to be converged on her/his preferred areas. The obtained solutions help
DM to adjust parameters of SpamAssassin system for her/his expected.
REFERENCES
[1]. H. Abbass, R. Sarker, and C. Newton, “PDE: A pareto-frontier differential
evolution approach for multi-objective optimization problems”, in
Proceedings of the IEEE Congress on Evolutionary Computation
(CEC2001), vol. 2. Piscataway, NJ: IEEE Press, 2001, pp. 971–978.
[2]. C. C. C. M.-M. E. Arias Montano, A., “Mode-ld+ss: A novel differential
evolution algorithm incorporating local dominance and scalar selection
mechanisms for multi-objective optimization”, In: 2010 IEEE Congress on
Evol Comp (CEC2010), 2010.
[3]. V. Basto-Fernandes, I. Yevseyeva, and J. R. M´endez, “Optimization of anti-
spam systems with multiobjective evolutionary algorithms”, Inf. Resour.
Manage. J., vol. 26, no. 1, pp. 54–67, Jan. 2000.
[4]. L. T. Bui, J. Liu, A. Bender, M. Barlow, S. Wesolkowski, and H. A. Abbass,
“Dmea: a direction- based multiobjective evolutionary algorithm”, Memetic
Computing, pp. 271–285, 2011.
[5]. K. P. C. Na Songkhla, “Statistical rules for thai spam detection”, Proceeding
of: Future Networks, pp. 178–184, 2010.
[6]. C. A. C. Coello, G. T. Pulido, and M. S. Lechuga, “Handling multiple
objectives with particle swarm optimization”, IEEE Trans. one Evolutionary
Computation, vol. 8, no. 3, pp. 256–279, 2004.
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2017 215
[7]. K. Deb, “Multiobjective Optimization using Evolutionary Algorithms”, New
York: John Wiley and Son Ltd, 2001.
[8]. K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist
multiobjective genetic algorithm: NSGA-II”, IEEE Trans. on Evolutionary
Computation, vol. 6, no. 2, pp. 182–197, 2002.
[9]. K. A. DeJong, “An analysis of the behavior of a class of genetic adaptive
systems”, Ph.D. dissertation, University of Michigan, Ann Arbor, 1975.
[10]. J. Knowles and D. Corne, “Approximating the nondominated front using the
pareto archived evolution strategy”, Evolutionary Computation, vol. 8, no.
2, pp. 149–172, 2000.
[11]. A. L´opez-Herrera, E. Herrera-Viedma, and F. Herrera, “A multiobjective
evolutionary algorithm for spam e-mail filtering”, in Intelligent System and
Knowledge Engineering, 2008. ISKE 2008. 3rd International Conference on,
vol. 1. IEEE, 2008, pp. 366–371.
[12]. F. J. V. T. M.T. Vu, Q.A. Tran, “Multilingual rules for spam detection,”
Proceedings of the 7th International Conference on Broadband and
Biomedical Communications (IB2COM 2012), pp. 106–110, 2012.
[13]. L. Nguyen, L. T. Bui, and H. A. Abbass, “Dmea-ii: the direction-based
multi-objective evolu- tionary algorithm-ii”, Soft Computing, vol. 18, no. 11,
pp. 2119–2134, 2014.
[14]. L. O¨ zgu¨r, T. Gu¨ngo¨r, and F. S. Gu¨rgen, “Adaptive anti-spam
filtering for agglutinative lan- guages: a special case for turkish”, Pattern
Recognition Letters, vol. 25, no. 16, pp. 1819–1831, 2004.
[15]. H. N. Phuong L.H, A. Roussanaly, and H. T. Vinh, “A hybrid
approach to word segmentation of vietnamese texts”, vol. 5196, pp. 240–
249, 2008. [Online]. Available:
423
[16]. X. L. Q.A. Tran, H. Duan, “Real-time statistical rules for spam detection”,
Proceedings of the International Journal of Computer Science and Network
Security, pp. 178–184, 2006.
[17]. R. Storn and K. Price, “Differential evolution - a simple and efficient
adaptive scheme for global optimization over continuous spaces”, technical
report pp-95-012, ICSI, Tech. Rep., 1995.
[18]. I. Yevseyeva, V. Basto-Fernandes, and J. R. M´endez, “Survey on anti-
spam single and multi- objective optimization”, in ENTERprise
Information Systems. Springer, 2011, pp. 120–129.
[19]. I. Yevseyeva, V. Basto-Fernandes, D. Ruano-Ord´as, and J. R. M´endez,
“Optimising anti-spam filters with evolutionary algorithms”, Expert Systems
with Applications, 2013.
[20]. Q. F. Zhang and H. Li., “Moea/d: A multi-objective evolutionary algorithm
based on decomposition”, 2007.
[21]. E. Zitzler, M. Laumanns, and L. Thiele, “SPEA2: Improving the strength
pareto evolutionary algorithm for multiobjective optimization”, in
Evolutionary Methods for Design Optimization and Control with
Applications to Industrial Problems, K. C. Giannakoglou, D. T. Tsahalis, J.
Công nghệ thông tin
N. Long, , B. T. Lam, “Detecting Vietnamese spams using a evolutionary approach.” 216
P. eriaux, K. D. Papailiou, and T. Fogarty, Eds. Int. Center for Numerical
Methods in Engineering (Cmine), 2001, pp. 95–100.
[22]. E. Zitzler, L. Thiele, and K. Deb, “Comparision of multiobjective
evolutionary algorithms: Emprical results”, Evolutionary Computation, vol.
8, no. 1, pp. 173–195, 2000.
TÓM TẮT
PHÁT HIỆN THƯ RÁC TIẾNG VIỆT BẰNG PHƯƠNG PHÁP SỬ DỤNG GIẢI
THUẬT TIẾN HÓA TỐI ƯU ĐA MỤC TIÊU
Bài báo giới thiệu việc sử dụng giải thuật đa mục tiêu để giải quyết bài
toán phát hiện thư rác tiếng Việt. Bái toán được mở rộng và phân tích đặc
điểm tiếng Việt, tiếng Anh, tiếng Trung Quốc sai đó sử dụng bài toán tối ưu
đa mục tiêu cho việc phát hiện thư rác đươc mô hình hóa. Bằng việc sử dụng
tối ưu đa mục tiêu, người sử dụng, quản trị sẽ linh hoạt trong việc cấu hình
hệ thống. Phương pháp tiếp cận tối ưu đa mục tiêu để tìm tập giải pháp khả
thi cho hệ thống lọc thư rác ( sử dụng hệ thống Apache SpamAssassin). Hai
mục tiêu được đưa ra là: Tỷ lệ phát hiện thư rác (Spam Detection Rate –
SDR) và tỷ lệ cảnh báo nhầm (False Alarm Rate – FAR). Thí nghiệm được
thực hiện dựa trên dữ liệu đa ngôn ngữ (tiếng Việt, tiếng Anh và tiếng Trung
Quốc) thông qua một số hoạt cảnh với số lượng luật SpamAssassin khác
nhau. Trong thí nghiệm, hai giải thuật tiến hóa tối ưu đa mục tiêu được sử
dụng (DMEA-II và NSGA-II) và một giải thuật đơn mục tiêu để so sánh hiệu
quả của đề xuất. Theo phân tích kết quả, phương pháp mới không tạo ra hiệu
quả tốt hơn mà còn tao ra một tập luật có thể sử dụng với các mức thỏa hiệp
khác nhau của SDR và FAR. Kết quả thí nghiệm chỉ ra rằng với việc ứng
dụng tiếp cận tối ưu đa mục tiêu với tập đa ngôn ngữ giúp cho người dùng
có thể cấu hình hệ thống đảm bảo hoạt động linh hoạt và hiệu quả trong việc
phát hiện thư rác cho các tổ chức, cơ quan, doanh nghiệp.
Tõ khãa: Thư rác, SpamAssassin, Tối ưu đa mục tiêu, SDR, FAR, NSGA-II, DMEA-II.
Nhận bài ngày 16 tháng 8 năm 2017
Hoàn thiện ngày 26 tháng 11 năm 2017
Chấp nhận đăng ngày 28 tháng 11 năm 2017
Address: 1National Defense Academy;
2MITI, Military Academy of Science and Technology;
3Hanoi University;
4Posts and Telecommunications Institute of Technology;
5Le Quy Don Technical University.
* Email: longn@mta.edu.vn.
Các file đính kèm theo tài liệu này:
- 21_7427_2151893.pdf