by Guest » Fri Apr 08, 2011 6:13 am
Gah, I give up even logging in. I'll just sign my posts so you know who it is. :p
Well, I didn't go down the JFlex route, so I don't know how hard that really is. But from reading your posts and from the fact that adding keywords is hard, I would guess Ragel is easier. You basically write regex and you can attach code (aka "actions") on any state changes. As with most regex, it is daunting at first glance, but Ragel has a very good documentation PDF that is all you need. Quick example:
- Code: Select all
'\'' @stringChar
^'\''* >buffer %string
'\'' @stringChar
This defines 3 "machines". I put them on different lines so you can more easily see each machine. Here is what they do:
1) This matches a literal single quote. The @ means "finished", so when single quote is matched, call the code snippet (aka action) called "stringChar".
2) This matches single quote, ^ negates it to match anything except single quote, and * means zero or more. So this matches zero or more characters that are not single quote. The > means "entering", so the "buffer" action is called when the machine is entered. The buffer action is code I wrote which stores the current offset into the char[]. The % means "leaving" (slightly different from @, but don't worry about that for now), so the "string" action is called when the machine is exited. This action uses the stored offset and the current offset to know all the characters that were matched so it can add a token.
3) This is the same machine as 1.
See TableLayoutTokenizer I linked above. I only parse for "keywords" without knowing what keyword it is, then I use a HashSet to see if it is a keyword. This makes it super simple for someone to come along and add keywords.
Gah, I give up even logging in. I'll just sign my posts so you know who it is. :p
Well, I didn't go down the JFlex route, so I don't know how hard that really is. But from reading your posts and from the fact that adding keywords is hard, I would guess Ragel is easier. You basically write regex and you can attach code (aka "actions") on any state changes. As with most regex, it is daunting at first glance, but Ragel has a very good documentation PDF that is all you need. Quick example:
[code]
'\'' @stringChar
^'\''* >buffer %string
'\'' @stringChar[/code]
This defines 3 "machines". I put them on different lines so you can more easily see each machine. Here is what they do:
1) This matches a literal single quote. The @ means "finished", so when single quote is matched, call the code snippet (aka action) called "stringChar".
2) This matches single quote, ^ negates it to match anything except single quote, and * means zero or more. So this matches zero or more characters that are not single quote. The > means "entering", so the "buffer" action is called when the machine is entered. The buffer action is code I wrote which stores the current offset into the char[]. The % means "leaving" (slightly different from @, but don't worry about that for now), so the "string" action is called when the machine is exited. This action uses the stored offset and the current offset to know all the characters that were matched so it can add a token.
3) This is the same machine as 1.
See TableLayoutTokenizer I linked above. I only parse for "keywords" without knowing what keyword it is, then I use a HashSet to see if it is a keyword. This makes it super simple for someone to come along and add keywords.