파서 놀이 1 - 간단한 렉서 만들기

주요글: 도커 시작하기

파서 놀이 1 - 간단한 렉서 만들기

2015. 7. 8. 08:44

다음은 아파치 웹 서버 설정을 일부 따라서 만들 설정이다.

allow 172.20.0.128/25

allow 172.30.0.1-255

allow 172.30.1.1

deny 10.10.10.10

order deny, allow

이 설정은 접근을 허용하거나 차단할 IP 범위를 설정할 목적으로 만들어봤다. 이 글에서는 이 설정을 읽어와 파싱해서 실제 모델로 만들어주는 파서를 만드는 연습을 해 볼까 한다.

이 설정을 위한 BNF는 다음과 같이 작성해 볼 수 있을 것 같다. (BNF는 보는 책이나 문서마다 그 형식이 달라서 작성할 때 마다 헷갈린다.)

config : allowOrDenyDeclList orderDecl;

allowOrDenyDeclList : allowOrDenyDecl*;

allowOrDenyDecl : allowDecl | denyDecl;

allowDecl : ALLOW iprange;

denyDecl : DENY iprange;

orderDecl : ORDER allowDeny | denyAllow;

ORDER : 'order';

allowDeny : ALLOW ',' DENY;

denyAllow : DENY ',' ALLOW;

ALLOW : 'allow';

DENY : 'deny';

iprange : DIGIT+ '.' DIGIT+ '.' DIGIT+ '.' DIGIT+ ( ('/' | '-') DIGIT+)?;

DIGIT : '0'..'9';

위 문법을 구현한 파서를 바로 만들면 좋겠지만, 그 전에 할일이 하나 있다. 그것은 바로 렉서(lexer)를 만드는 것이다. 범용 프로그래밍 언어를 만들려면 렉서나 파서를 구현하는 것이 복잡하겠지만 여기서 만들 렉서는 매우 제한된 영역을 다루므로 매우 간단하게 구현해 볼 것이다.

간단한 렉서 만들기

정규 표현식을 사용해서 초간단 렉서를 만들어보자. 이 렉서는 설정 문자열을 읽어와 해당 토큰으로 변환해주는 기능을 제공한다. 예를 들어, 다음 설정 정보를 보자.

allow 1.2.3.4

만들어 볼 렉서는 이 설정 문자열로부터 다음과 같은 토큰을 생성한다.

ALLOW(value='allow") WS(value=' ') IPRANGE(value='1.2.3.4')

파서는 이 토큰 스트림으로부터 파싱을 수행해서 최종 결과를 만들게 된다.

렉서가 생성할 토큰의 종류는 다음과 같다.

ALLOW : 'allow' 키워드
DENY : 'deny' 키워드
IPRANGE : IP 범위 값
ORDER : 'order' 키워드
COMMA : 콤마(,)
WS : 공백문자(' \t\r\n')

토큰 종류는 다음 특징을 갖는다.

각 토큰 종류에 포함되는 토큰 값을 정규표현식을 사용해서 표현할 수 있다. 예를 들어, 'allow' 키워드는 "^allow"로, 공백문자는 "^(\s)*"로 표현할 수 있다.
공백 문자는 파서에 전달할 필요가 없는 토큰이다.

이 두 정보를 담는 열거 타입 TokenType은 토큰을 분리할 때 사용할 정규 표현식과 결과에 포함되는지 여부를 담는다.

public enum TokenType {

TT_ALLOW("^allow", true),

TT_DENY("^deny", true),

TT_ORDER("^order", true),

TT_IPRANGE("^[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]((/|-)[0-9]+)?", true),

TT_COMMA(",", true),

TT_WS("^[ \t\r\n]+", false),

TT_EOF(null, true);

private String regex;

private boolean outputIncluded;

TokenType(String regex, boolean outputIncluded) {

this.regex = regex;

this.outputIncluded = outputIncluded;

}

public String getRegex() {

return regex;

}

public boolean isOutputIncluded() {

return outputIncluded;

}

public boolean hasRegex() {

return regex != null && !regex.isEmpty();

}

개별 토큰을 위한 클래스인 Token은 다음과 같다.

public class Token {

private TokenType type;

private String value;

public Token(TokenType type, String value) {

this.type = type;

this.value = value;

}

public TokenType getType() {

return type;

}

public String getValue() {

return value;

}

// equals() 메서드...

}

TokenType을 이용해서 입력 문자열로부터 토큰을 분리해서 Token 목록을 제공하는 간단한 렉서를 만들어보자. 구현의 단순함을 위해 이 렉서는 문자열을 입력받아 토큰 목록을 리턴한다. 토큰 목록은 다음과 같은 토큰 버퍼에 담아 리턴한다.

import java.util.List;

public class TokenBuffer {

private List<Token> tokenList;

private int currentPosition = 0;

public TokenBuffer(List<Token> tokenList) {

this.tokenList = tokenList;

}

public Token currentToken() {

return tokenList.get(currentPosition);

}

public Token currentTokenAndMoveNext() {

return tokenList.get(currentPosition++);

}

public boolean hasNext() {

return currentPosition < tokenList.size() - 1;

}

public boolean hasCurrent() {

return currentPosition < tokenList.size();

}

public void moveNext() {

currentPosition++;

}

public int currentPosition() {

return currentPosition;

}

...

}

이제 Lexer 코드를 보자.

import java.util.ArrayList;

import java.util.Iterator;

import java.util.List;

import java.util.regex.Matcher;

import java.util.regex.Pattern;

public class Lexer {

private String code;

private List<Token> tokenList = new ArrayList<>();

private List<TokenTypePattern> typePatterns = new ArrayList<>();

public Lexer(String code) {

this.code = code;

for (TokenType type : TokenType.values()) {

if (type.hasRegex())

typePatterns.add(new TokenTypePattern(type));

}

public TokenBuffer tokenize() {

while(matchToken() && !eof()) {

}

if (!eof()) {

// 일치하지 않은 토큰이 존재하는 것임!

throw new MatchingTokenNotFoundException();

}

tokenList.add(new Token(TokenType.EOF, null));

return new TokenBuffer(tokenList);

}

private boolean matchToken() {

boolean match = false;

Iterator<TokenTypePattern> patterIter = typePatterns.iterator();

while(!match && patterIter.hasNext()) {

TokenTypePattern ttPattern = patterIter.next();

Matcher matcher = ttPattern.pattern.matcher(code);

if (matcher.find()) {

if (ttPattern.type.isOutputIncluded()) {

tokenList.add(new Token(ttPattern.type, matcher.group()));

}

match = true;

code = code.substring(matcher.end());

}

return match;

}

private boolean eof() {

return code.length() == 0;

}

private class TokenTypePattern {

private TokenType type;

private Pattern pattern;

public TokenTypePattern(TokenType type) {

this.type = type;

this.pattern = Pattern.compile(type.getRegex());

}

생성자는 토큰을 추출할 문자열을 전달받아 code 필드에 저장한다. 생성자는 TokenType으로부터 TokenTypePattern 리스트를 생성한다. TokenTypePattern은 내부 클래스로 정의되어 있으며, TokenType과 토큰을 식별할 때 사용할 Pattern을 저장한다.

주요 코드는 matchToken() 메서드에 있다. matchToken() 메서드는 TokenTypePattern의 Pattern을 이용해서 code 문자열 앞 부분이 일치하는 TokenType을 검색한다. 일치하는 TokenType이 존재하면 해당 문자열로부터 Token을 생성해서 tokenList에 추가한다. 그리고, code 문자열에서 일치한 부분을 제외한 나머지 부분을 다시 code에 할당한다.

matchToken() 메서드는 일치한 토큰이 존재하면 true를 리턴하고 그렇지 않으면 false를 리턴한다. tokenize() 메서드는 이 리턴 값을 이용해서 계속해서 토큰을 추출할지 여부를 결정한다. tokenize() 메서드는 while을 이용해서 matchToken()이 true를 리턴하고 eof()가 false일 때까지 계속해서 이 과정을 반복한다. eof()는 더 이상 추출할 문자열이 없으면 true를 리턴하므로, tokenize()의 while 문은 TokenType에 일치하는 토큰이 없거나 끝까지 모두 토큰을 추출한 경우 끝이난다.

tokenize()의 while이 끝난 뒤 eof() 여부를 다시 확인하는데, 만약 eof()가 true가 아니면 일치하지 않는 토큰이 존재한다는 것이므로 익셉션을 발생한다. 그렇지 않고 끝까지 토큰을 추출했다면 TokenBuffer를 리턴한다.

간단 Lexer 구현 테스트

테스트 코드를 만들어서 동작을 확인했다.

public class LexerTest {

@Test

public void noTokens() throws Exception {

Lexer lexer = new Lexer("");

TokenBuffer tokenBuffer = lexer.tokenize();

assertThat(tokenBuffer.currentToken(), equalTo(eofToken()));

assertThat(tokenBuffer.hasNext(), equalTo(false));

}

@Test

public void twoTokens() throws Exception {

Lexer lexer = new Lexer("allow 1.2.3.4");

TokenBuffer tokenBuffer = lexer.tokenize();

assertThat(tokenBuffer.currentTokenAndMoveNext(),

equalTo(token(TokenType.TT_ALLOW, "allow")));

assertThat(tokenBuffer.currentTokenAndMoveNext(),

equalTo(token(TokenType.TT_IPRANGE, "1.2.3.4")));

assertThat(tokenBuffer.nextTokenAndMoveNext(), equalTo(eofToken()));

assertThat(tokenBuffer.hasNext(), equalTo(false));

assertThat(tokenBuffer.hasCurrent(), equalTo(false));

}

@Test

public void invalidToken() throws Exception {

Lexer lexer = new Lexer("allow 1.2.3.4 noToken");

try {

lexer.tokenize();

fail();

} catch(MatchingTokenNotFoundException ex) {

}

@Test

public void invalidToken2() throws Exception {

Lexer lexer = new Lexer("allow 1.2.3.4/");

try {

lexer.tokenize();

fail();

} catch(MatchingTokenNotFoundException ex) {

}

private static Token token(TokenType type, String value) {

return new Token(type, value);

}

private static Token eofToken() {

return new Token(TokenType.EOF, null);

}

일단 잘 동작하는 것 같다. 다음 글에서는 간단하게 Recursive Descent Parser를 만들어보자.

저작자표시 비영리 변경금지

자바캔(Java Can Do IT)

파서 놀이 1 - 간단한 렉서 만들기

+ Recent posts

티스토리툴바