The tokenizer package
The tokenizer package allows to tokenize java and C/C++ source code either at the line level or at the file level.
What are the different tokenizers provided?
Generic
- UTFTokenizer based on the one created for Terrier, see: UTF Tokeniser
Tokenizer that uses everything nonalphanumerical as a delimiter, delimiter in a row are kept together, while only space delimiter is removed.
Available at line and file level.
Java
Code Based
- UTFWocTokenizer: same tokenizer as UTF except that comments are removed (available only at file level)
- JavaLemmeTokenizer: Tokenizer based on the lemmatization of java code performed by Java Parser, available for file level as well as line level.
- JavaLemmeWocTokenizer: Same as JavaLemmeTokenizer, except that comments are removed, available only at the file level
AST Based (only available at the file Level)
- DepthFirst: Tokenize according to the AST generated by JavaParser and go through the tree depth first, each token correspond to the text serialization of a node
- BreadthFirst: Tokenize according to the AST generated by JavaParser and go through the tree breadth first, each token correspond to the text serialization of a node
- DepthFirstPruned: same than DepthFirst except that intermediate node of the tree are removed
- BreadthFirstPruned: same than BreadthFirst except that intermediate node of the tree are removed
C/C++
Code Based
- CPPLemmeTokenizer:Tokenizer based on the lemmatization of cpp code performed by ANTLR CPP14 parser, available for file level as well as line level.
AST Based (only available at the file Level)
- CPPASTTokenizer: Tokenize according to the AST generated by Joern and go through the tree depth first, each token corresponds to the text serialization of a node
Requirements
Dependencies are handled through maven, all of them will be downloaded except for joern-antlr that need to be installed first and that can be found at this link
Architecture
All tokenizer inherit from the Abstract tokenizer interface, which gives intel about the scope of the tokenizer and its type
/**
* AbstractTokenizer interface
*/
public interface AbstractTokenizer {
/**
* Scope of the tokenizer, lines or files
* @return the Scope
*/
Scope getScope();
/**
* Type of the Tokenizer
* @return the type of tokenizer
*/
String getType();
}
Then depending on whether it’s a File Level tokenizer or a Line Level one, it will either inherit from
AbstractLineTokenizer
package tokenizer.line;
import tokenizer.AbstractTokenizer;
import tokenizer.Scope;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
public abstract class AbstractLineTokenizer implements AbstractTokenizer {
/**
* Tokenize
*/
/**
* Method to tokenize a reader
*
* @param reader to use
* @return an array of array(line) of tokens
* @throws IOException in case of exception from the reader
*/
public abstract Iterable<Iterable<String>> tokenize(Reader reader) throws IOException;
/**
* Method to tokenize a string
*
* @param s string to tokenize
* @return an array of array (line) token
*/
public Iterable<Iterable<String>> tokenize(String s) throws IOException {
Reader r = new StringReader(s);
Iterable<Iterable<String>> result = tokenize(r);
r.close();
return result;
}
public Scope getScope() {
return Scope.LINE;
}
}
or AbstractFileTokenizer
package tokenizer.file;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import tokenizer.AbstractTokenizer;
import tokenizer.Scope;
import tokenizer.file.java.exception.UnparsableException;
public abstract class AbstractFileTokenizer implements AbstractTokenizer {
/**
* Tokenize
*/
/**
* Method to tokenize a reader
*
* @param reader to use
* @return an array of tokens on which all preprocessor registered has been applied
* @throws IOException in case of reader exception
* @throws UnparsableException if the content of reader could not be parsed
*/
public abstract Iterable<String> tokenize(Reader reader) throws IOException, UnparsableException;
/**
* Method to tokenize a string
*
* @param s string to tokenize
* @return an array of tokens on which all preprocessor registered has been applied
*/
public Iterable<String> tokenize(String s) throws IOException, UnparsableException {
Reader r = new StringReader(s);
Iterable result = tokenize(r);
r.close();
return result;
}
public Scope getScope() {
return Scope.FILE;
}
}
Both provide method to tokenize either from a String or a reader, but differ in their output, where the File Tokenizer return an Iterable
How to use the tool
To obtain a tokenizer different factories are provided:
File Level
CPPFileTokenizerFactory JavaFileTokenizerFactory
Line Level
CPPLineTokenizerFactory JavaLineTokenizer
then once the choice made just call
JavaFileTokenizerFactory.lemmeTokenizer();
Third Party tool
- Java Parser (LGPL)
- Joern (LGPL)
- ANTLR4 (BSD)