November 26, 2019
Students: Rodrigo Martinez
Tutors: <a href=https://www.fing.edu.uy/inco/grupos/gsi/en/team/gustavo-betarte/>Gustavo Betarte</a>, Alvaro Pardo
Despite all effort of the security community, for example initiatives as the OWASP Top 10, it is a known fact that web applications are permanently being exposed to attacks that exploit their vulnerabilities. Some web applications vulnerabilities can only be discovered as a result of a process of trial and error performed by an at- tacker. The identification and determination of a user’s behavior using attack detec- tion techniques become crucial, these techniques assist in aspects such as preventing attackers to identify/verify successfully the existence of vulnerabilities in applica- tions and to minimize the number of false positives (non-malicious activity identi- fied as such). A technological alternative for performing real-time attack analysis is the use of a Web Application Firewall (WAF), systems that intercepts and inspects all traffic between the web server and its clients, searching for attacks in the communi- cation’s content. Most WAF works by using a set of statics rules defined to identify attacks.
In this thesis, we analyze the use of machine learning techniques to enhance web applications attack detection in MODSECURITY, an open source WAF that has became a de facto standard implementation.
We first propose a characterization of the problem by defining different scenarios depending on whether we have application’ specific or generic data, as well as, valid and/or attack traffic available for training. We also analyze existing dataset to use in this context and we have created our own dataset by capturing real traffic to a real life application.
We finally present two supervised machine learning solutions. The first is a clas- sic discrimination approach between two classes (valid traffic and attacks). The second is a one-class classification solution for a more realistic scenario when only valid data is available. In the one-class classification approach it is assumed that one of the classes can be properly modeled using data from the training set (in our case the valid traffic) while the other class (in our problem attacks) can not be modeled by total or partial lack of training samples. We present results using both ap- proaches and compare them with MODSECURITY configured with the OWASP Core Rule Set out of the box, which is the most widely deployed set of rules.