What is RESS
The basic use of the Rusty ECMAscript Scanner
2018-10-02After releasing the Rusty ECMA Script Scanner (RESS) 0.5, my next big effort in the Rust+Javascript space is to increase the amount of documentation. This post is an effort to clarify what RESS does and how someone might use it.
The first thing to cover is to answer the question What is a scanner's role in parsing code? The basic idea behind a scanner is to stand between the bare text of a code file and the component that will interpret the larger context (most likely a parser). The idea here is to separate the process of reading text from determining the actual meaning of that text. A scanner typically provides a series of tokens to whoever is using it. A token can be as simple as a single character or it can be more complicated like a word, it all depends on the scanner. For RESS I wanted to give a little more information than a single character and I ended up with 11 distinct tokens.
RESS Tokens
- Boolean -
true
orfalse
- End of File - The end of input
- Identifier - A variable, function, property or method name i.e.
Math
,x
- Keyword - A reserved word, something that carries additional meaning i.e.
var
,function
,while
- Null -
null
- Number - A number i.e.
-0.11
,0xccc
,0o777
,0b101010
,4e1123
- Punctuation - A symbol that carries additional meaning i.e.
{
,)
,^
,+
,===
- String - A quoted block of text i.e.
'use strict'
,"stuff in quotes"
- RegEx - A regular expression literal i.e.
/[a-zA-Z]+/g
- Template - A string with values inside i.e.
`please give me ${4} more pieces.`
- Comment - Text wrapped in
/* */
or preceded by//
i.e.//single line comment
,/* single line multi-line comment */
These are the different types of information that a word or symbol in a javascript file might represent. Let's use an example to illustrate what I mean.
function print(message) {
console.log(message)
}
the above javascript as tokens would look like this.
function
- Keywordprint
- Identifier(
- Punctuationmessage
- Identifier)
- Punctuation{
- Punctuationconsole
- Identifier.
- Punctuationlog
- Identifier(
- Punctuationmessage
- Identifier)
- Punctuation}
- Punctuation
It might seem foreign to think about any code like this since we typically thing about larger parts of our code, like functions definitions or variable assignments. Eventually, any parser would get to that level but we are working one step below that. Normally the scanner sits
idly by until the parent process asks for another token, this makes it a great fit for implementing the rust Iterator
trait. This is exactly
how the Scanner
works, you can create a new scanner with text and then loop over its results.
extern crate ress;
use ress::{Scanner};
fn main() {
let js = "function print(message) {
console.log(message)
}";
let scanner = Scanner::new(js);
for item in scanner {
println!("{:?} ", item.token);
}
}
The above program would output the following.
Keyword(Keyword::Function)
Ident(Ident("print"))
Punct(Punct::OpenParen)
Ident(Ident("message"))
Punct(Punct::CloseParen)
Punct(Punct::OpenBrace)
Ident(Ident("console"))
Punct(Punct::Period)
Ident(Ident("log"))
Punct(Punct::OpenParen)
Ident(Ident("message"))
Punct(Punct::CloseParen)
Punct(Punct::CloseBrace)
One of the nice things about the rust Iterator
trait is the amount of stuff you get for free once you have implemented it. You can
use filter
or map
with it, you can even collect
it up into a Vec
, right out of the box.
Now that we have that basic idea about how RESS works, it might not seem super helpful all by itself. To build something with RESS
you would need to embed a Scanner
into something that knows about the larger context. It would be the parent's job to tie all of the tokens together into something more useful. Let's build something to see more about what I mean.
For this example we are going to build a tool to help a team of JS developers to stick to a very simple style guideline "no semi-colons should be used". This should be pretty straight forward since the only context we care about is that no semi-colons are allowed, so either all of the tokens pass or all the tokens fail.
To get started, let's create this project using cargo
.
cargo new semi_finder
cd semi_finder
Then we will add ress
and, since I want to be able to evaluate a whole directory of files, walkdir
as dependencies. If you are not
familiar with walkdir
it is a library for easily dealing with file system directories and their contents.
Cargo.toml
[package]
name = "semi_finder"
version = "0.1.0"
authors = ["Robert Masen <r@robertmasen.pizza>"]
[dependencies]
ress = "0.4"
walkdir = "2.2"
src/main.rs
extern crate ress;
extern crate walkdir;
Once we have them added, we can import the things we want to use from them. From ress
we are going to need the Scanner
and the Punct
since the semi-colon is a variant of a Punct
. From walkdir
we are going to use the WalkDir
struct. Finally
from the standard library we need collections::HashMap
to allow us to keep track of the files that have semi-colons,
env::args
to get the arguments from the command line, fs::read_to_string
to easily read a file into a string, and
path::PathBuf
to pass our paths around. All of that would look like this.
use ress::{Punct, Scanner};
use walkdir::WalkDir;
use std::{
collections::HashMap,
env::args,
fs::read_to_string,
path::PathBuf
};
Next, let's write a function that will take in a js string, pass it to a Scanner
and then check for any semi-colons. If we find any semi-colons, we are going to add it's index to a Vec
that will get returned when we are done.
fn check_js(js: &str) -> Vec<usize> {
// Create a scanner with the text then
// filter out any tokens that are not semi-colons
// then collect them all into a `Vec` of the start indexes
Scanner::new(js).filter_map(|item| {
// If this token matches the `Punct::SemiColon`
if item.token.matches_punct(Punct::SemiColon) {
// we want to return the first position of this token
// since semi-colons are only 1 character wide we would
// only need this part of the `Span`
Some(item.span.start)
} else {
None
}
}).collect()
}
Now that we can determine if any arbitrary javascript contains semi-colons, let's build a function that will loop over
a directory of files and checks each one. This time we are going to collect all of our results into a HashMap
with the path
as the key and the list of semi-colon indexes as the value.
fn check_files(start: String) -> HashMap<PathBuf, Vec<usize>> {
// We are going to store the location of any semi-colons we have found
let mut ret: HashMap<PathBuf, Vec<usize>> = HashMap::new();
// loop over the directories in our path
// set the min_depth to 1, so we will skip the
// path passed in as `start`
for entry in WalkDir::new(start).min_depth(1) {
match entry {
Ok(entry) => {
// If the entry doesn't error
// capture the path of this entry
let path = entry.path();
//if the path ends with js, we want to check for semicolons
if path.extension() == Some(::std::ffi::OsStr::new("js")) {
// if we can read the file to a string
// pass the text off to our check_js fn
// if we can't we'll just skip it for now
if let Ok(js) = read_to_string(path) {
let indexes = check_js(&js);
// if we found any semicolons, add them to our hashmap
if !indexes.is_empty() {
ret.insert(path.to_path_buf(), indexes);
}
}
}
},
Err(e) => eprintln!("failed to get a directory entry: {:?}", e),
}
}
ret
}
At this point all we have left to do is provide a small CLI for it. Let's update main
to include the following. It is a bit of a naive approach to parsing arguments but we don't really need much.
fn main() {
// get the command line arguments that started this process
let mut args = args();
// discard the first argument, this will be the path to our
// executable
let _ = args.next();
// The next argument will be the path to check
// display an error to the user if no path
// was provided and exit early
let start = if let Some(start) = args.next() {
eprintln!("No directory provided as starting location.");
return;
}
// Pass the argument off to our `check_files` function
let issues = check_files(start);
// If no issues were found
if issues.is_empty() {
// Print the success message
println!("Good to go, no semicolons found");
} else {
// Otherwise loop over the hashmap and
// tell the user where we found semi-colons that need to be
// removed
for (path, indexes) in issues {
println!("{} issues found in {:?} at indexes:", indexes.len(), path);
println!("\t{:?}\n", indexes)
}
}
}
To test all this out let's put a file in our project root called js
and add a file called file.js
with the following.
function hideContent(el) {
console.log('hideContent', el);
let content = el.parentElement.children[1]
console.log(content)
let currentClass = content.getAttribute('class')
let newClass;
if (currentClass.indexOf('hidden') > -1) {
newClass = currentClass.replace(' hidden')
} else {
newClass = currentClass + ' hidden'
}
content.setAttribute('class', newClass);
}
Then run the following to evaluate the file.
cargo run -- ./js/
3 issues found in "./path/to/js/file.js" at indexes:
[62, 209, 421]
Looks like it is working just fine! It might be nice to have a method to go from an index to a line and column number but I think that might be for version 2.
In the process of putting this post together I have added the above code to the examples for RESS. You can find the full file here. If you have RESS cloned on your computer and you wanted to try this example out you can do so with the following command
cargo run --example semi_finder -- ./path/to/js
Next post, I plan on covering ressa
which uses ress
move up to the next level of a parser's job.