The tokenizer package

The tokenizer package allows to tokenize java and C/C++ source code either at the line level or at the file level.

What are the different tokenizers provided?

Generic

Tokenizer that uses everything nonalphanumerical as a delimiter, delimiter in a row are kept together, while only space delimiter is removed.

Available at line and file level.

Java

Code Based

AST Based (only available at the file Level)

C/C++

Code Based

AST Based (only available at the file Level)

Requirements

Dependencies are handled through maven, all of them will be downloaded except for joern-antlr that need to be installed first and that can be found at this link

Architecture

All tokenizer inherit from the Abstract tokenizer interface, which gives intel about the scope of the tokenizer and its type

/**
* AbstractTokenizer interface
*/
public interface AbstractTokenizer {
                                                            
    /**
    * Scope of the tokenizer, lines or files
    * @return the Scope
    */
    Scope getScope();
    /**
    * Type of the Tokenizer
    * @return the type of tokenizer
    */
    String getType();
}

Then depending on whether it’s a File Level tokenizer or a Line Level one, it will either inherit from

AbstractLineTokenizer

package tokenizer.line;

import tokenizer.AbstractTokenizer;
import tokenizer.Scope;

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;

public abstract class AbstractLineTokenizer implements AbstractTokenizer {
    /**
     * Tokenize
     */

    /**
     * Method to tokenize a reader
     *
     * @param reader to use
     * @return an array of array(line) of tokens
     * @throws IOException in case of  exception from the reader
     */
    public abstract Iterable<Iterable<String>> tokenize(Reader reader) throws IOException;

    /**
     * Method to tokenize a string
     *
     * @param s string to tokenize
     * @return an array of array (line) token
     */
    public Iterable<Iterable<String>> tokenize(String s) throws IOException {
            Reader r = new StringReader(s);
            Iterable<Iterable<String>> result = tokenize(r);
            r.close();
            return result;

    }

    public Scope getScope() {
        return Scope.LINE;
    }

}

or AbstractFileTokenizer

package tokenizer.file;

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;

import tokenizer.AbstractTokenizer;
import tokenizer.Scope;
import tokenizer.file.java.exception.UnparsableException;


public abstract class AbstractFileTokenizer implements AbstractTokenizer {

    /**
     * Tokenize
     */

    /**
     * Method to tokenize a reader
     *
     * @param reader to use
     * @return an array of tokens on which all preprocessor registered has been applied
     * @throws IOException         in case of reader exception
     * @throws UnparsableException if the content of reader could not be parsed
     */
    public abstract Iterable<String> tokenize(Reader reader) throws IOException, UnparsableException;

    /**
     * Method to tokenize a string
     *
     * @param s string to tokenize
     * @return an array of tokens on which all preprocessor registered has been applied
     */
    public Iterable<String> tokenize(String s) throws IOException, UnparsableException {
            Reader r = new StringReader(s);
            Iterable result = tokenize(r);
            r.close();
            return result;
    }

    public Scope getScope() {
        return Scope.FILE;
    }
}

Both provide method to tokenize either from a String or a reader, but differ in their output, where the File Tokenizer return an Iterable, the Line Tokenizer will return an Iterable<Iterable>.

How to use the tool

To obtain a tokenizer different factories are provided:

File Level

CPPFileTokenizerFactory JavaFileTokenizerFactory

Line Level

CPPLineTokenizerFactory JavaLineTokenizer

then once the choice made just call

JavaFileTokenizerFactory.lemmeTokenizer();

Third Party tool