by robert » Thu Feb 04, 2010 1:47 pm
I was thinking along the lines of a new method to go alongside getLastTokenTypeOnLine(), something like "getExtraDataForLine(int)". This would return arbitrary data that is meaningful to the current TokenMaker, but could vary from one TokenMaker to the next. The current TokenMakers, for example, wouldn't need it. This information could be used to specify things such as nested comment depth, the current "section" of a language's source code is divided into discrete sections, etc.
The implementation wouldn't be using a new int per line, but rather it would use space in the current "lastTokenTypeOnLine" list of ints. There would be a new limitation on number of states (say 256, should be more than enough) and the remaining 24 bits would be used for the "extra data." So no extra space or time overhead for languages that don't use the feature.
I'm starting to question the need for this though. Implementing this would require slightly modifying and recompiling (and re-testing) several current TokenMakers, and the thing is, I'm still not convinced this cannot be done with the current implementation. For example, instead of my previous proposal of a certain number of states representing "comment depth," say you had a single state be "comment, 1 deep" and each succeeding state (e.g. token type be an extra comment layer). For example, since currently, negative token types are used for states internal to a particular TokenMaker:
- Code: Select all
/**
* Type this TokenMaker for "last token type on line" for multi-line comments 1 level deep. Anything
* less than this is used to specify more layers; i.e. "-2" means "2 levels deep," "-3" means "3 levels
* deep," etc. This allows arbitrary nested comment depth. Any other internal states would have to
* have values in the range -1..-9 in this case.
*/
public static final int INTERNAL_MLC_DEPTH_1 = -10;
Then, just end un-ended MLC lines with (INTERNAL_MLC_DEPTH_1-depth+1) instead of COMMENT_MULTILINE. Your parsing code would decode the lastTokenType for the previous line, set a "depth" field, then parse the current line with this knowledge.
You mentioned earlier that you felt like this functionality should be "built-in" to RSTA, and while I agree to a certain extent, if a language supports nested comments, it'll have to write some code in its TokenMaker to support it whether RSTA has any built-in support or not, so I'm not sure how having support for a "comment depth" field, or arbitrary info-per-line field, really helps over my proposal above?
Or am I missing something?

Suggestions are welcome of course.
I was thinking along the lines of a new method to go alongside getLastTokenTypeOnLine(), something like "getExtraDataForLine(int)". This would return arbitrary data that is meaningful to the current TokenMaker, but could vary from one TokenMaker to the next. The current TokenMakers, for example, wouldn't need it. This information could be used to specify things such as nested comment depth, the current "section" of a language's source code is divided into discrete sections, etc.
The implementation wouldn't be using a new int per line, but rather it would use space in the current "lastTokenTypeOnLine" list of ints. There would be a new limitation on number of states (say 256, should be more than enough) and the remaining 24 bits would be used for the "extra data." So no extra space or time overhead for languages that don't use the feature.
I'm starting to question the need for this though. Implementing this would require slightly modifying and recompiling (and re-testing) several current TokenMakers, and the thing is, I'm still not convinced this cannot be done with the current implementation. For example, instead of my previous proposal of a certain number of states representing "comment depth," say you had a single state be "comment, 1 deep" and each succeeding state (e.g. token type be an extra comment layer). For example, since currently, negative token types are used for states internal to a particular TokenMaker:
[code]
/**
* Type this TokenMaker for "last token type on line" for multi-line comments 1 level deep. Anything
* less than this is used to specify more layers; i.e. "-2" means "2 levels deep," "-3" means "3 levels
* deep," etc. This allows arbitrary nested comment depth. Any other internal states would have to
* have values in the range -1..-9 in this case.
*/
public static final int INTERNAL_MLC_DEPTH_1 = -10;
[/code]
Then, just end un-ended MLC lines with (INTERNAL_MLC_DEPTH_1-depth+1) instead of COMMENT_MULTILINE. Your parsing code would decode the lastTokenType for the previous line, set a "depth" field, then parse the current line with this knowledge.
You mentioned earlier that you felt like this functionality should be "built-in" to RSTA, and while I agree to a certain extent, if a language supports nested comments, it'll have to write some code in its TokenMaker to support it whether RSTA has any built-in support or not, so I'm not sure how having support for a "comment depth" field, or arbitrary info-per-line field, really helps over my proposal above?
Or am I missing something? :D Suggestions are welcome of course.