How to handle 'free form' text alongside grammar defined statements

Discussion:

j***@jptechnical.co.uk

2013-06-16 18:45:10 UTC

I've just completed a *PLY* project to do some preprocessing on a 'C'
source file for a PIC microcontroller (specifically to handle my version of
the #rom directive converting it to the compilers version). This is working
just as I want but I have only defined the grammar for the specific #rom
statements I wish to process.

It would be convenient, in the same file, to have various other 'C'
statements (free from text), #define or typedefs for example. Can someone
suggest a way to do this without having to put the complete or subsections
of the 'C' grammar into my preprocessor. That is to say I want the
preprocessor to copy all non #rom statements (ie lines of other 'C' code)
'as is' to the output file. It is only the #rom statements I need to
process on the way to the output file.

Needless to say the 'free form text' and the #rom statements overlap in
terms of some tokens. The free from text may include 'C' expressions as may
a #rom statement, also literals such as { } ( ) @ etc.

Any ideas would be appreciated.

Regards,

John

--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ply-hack+***@googlegroups.com.
To post to this group, send email to ply-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ply-hack/d8801853-3b90-41b2-a929-6ea2f97ece19%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Bruce Frederiksen

2013-06-16 19:09:09 UTC

Permalink

One approach would be to have two different scanner states. To keep this
simple two conditions would have to be met:

1. A preprocessing statement can be recognized by it's first token.
I.e., the first token in any preprocessing statement doesn't appear in any
C code.
2. The end of each preprocessing statement can be determined without
having to look ahead at the next token (which would be the first C token to
copy after the preprocessing statement).

If these conditions can be met for your preprocessing statements, then the
two scanner states would be:

- A "copy" state that only recognizes and returns the first tokens for
each of the preprocessing statements and copies everything else.
- A "full scan" state that recognizes and returns all tokens.

The parser could then switch the scanner back and forth between these two
states. It would be written to recognize a series of preprocessing
statements and nothing else. The scanner would start in the "copy" state,
and all of the C code would not be seen by the parser. When the scanner
sees one of the first tokens, it returns it to the parser. Upon seeing the
first token for any preprocessing statement, the parser immediately
switches the scanner to the "full scan" state and parses the rest of that
preprocessing statement. When it encounters the end of the statement, it
switches the scanner back to the "copy" state.

Hope this helps.

BTW, are you making this code available? I've been thinking that it would
nice to have a general macro processing preprocessor that would be easier
to use than m4 <http://www.gnu.org/software/m4/>. Wondering if what you're
doing might be a start for such a thing?

-Bruce

Post by j***@jptechnical.co.uk
I've just completed a *PLY* project to do some preprocessing on a 'C'
source file for a PIC microcontroller (specifically to handle my version of
the #rom directive converting it to the compilers version). This is working
just as I want but I have only defined the grammar for the specific #rom
statements I wish to process.
It would be convenient, in the same file, to have various other 'C'
statements (free from text), #define or typedefs for example. Can someone
suggest a way to do this without having to put the complete or subsections
of the 'C' grammar into my preprocessor. That is to say I want the
preprocessor to copy all non #rom statements (ie lines of other 'C' code)
'as is' to the output file. It is only the #rom statements I need to
process on the way to the output file.
Needless to say the 'free form text' and the #rom statements overlap in
terms of some tokens. The free from text may include 'C' expressions as may
Any ideas would be appreciated.
Regards,
John
--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To unsubscribe from this group and stop receiving emails from it, send an
To view this discussion on the web visit
https://groups.google.com/d/msgid/ply-hack/d8801853-3b90-41b2-a929-6ea2f97ece19%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ply-hack+***@googlegroups.com.
To post to this group, send email to ply-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ply-hack/CAEs%3D1RgKhxyTmcTY0SFwV_ziNUD%2BhXYcaDWWU39uGVvDhX%3DTcg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

j***@jptechnical.co.uk

2013-06-16 19:25:33 UTC

Permalink

Thanks for ideas so far. The grammer I've defined to date is

*#rom* *ID*, *int8* { *int8*, *int8* }

*#rom* *ID*, *int16* { *int16*, *int16* }

*#rom* *ID*, "*Quoted text string*"

*#rom* *ID*, """

*Block text a*

*la Python*

"""

*#rom* (*locateLo*|*locateHi*)@*addr*

IDs are general identifers so each rom block can be referred to by the rest
of the 'C' code. *int8*, *int16,* *locateLo *and *locateHi* are reserved
words. *int8*, *int16* and *addr *are general 'C' expressions.

I'll certainly make the source available but it's rather specific compared
to something like M4. What's the best way to do that? it's ~640 lines at
the moment.

--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ply-hack+***@googlegroups.com.
To post to this group, send email to ply-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ply-hack/790ee177-3522-48ac-867d-d31faf4f5662%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Bruce Frederiksen

2013-06-16 20:21:34 UTC

Permalink

It looks like all of the statements satisfy the first requirement (starts
with a non-C token), since #rom is not a normal C identifier.

All but the last satisfy the second requirement (can recognize the end of
the statement without looking ahead). As an example for the last
statement, after the parser sees:

#rom ***@A + B

it would need to peek at the next token to see if it is one that would
continue the expression (such as +, -, *, etc), or something else such as
the 'void' token for a following function definition:

#rom ***@A + B + C

vs.

#rom ***@A + B

void foo(int x) {
...
}

In the second case, the 'void' token would be scanned in the "full scan"
state, so would not be copied to the output. Having seen 'void', the
parser would return the scanner to the "copy" state, where it would pick up
with "foo"... and start copying from there, missing 'void'.

If it is possible to change the syntax, something like:

#rom locateLo(addr)

or

#rom ***@addr;

would correct this. Then the ) or ; would signal the end of the statement
with no need to look any further.

As to generalizing this, I would think that most of the scanner tokens
would be the same for most languages (various literals, identifiers, return
any single special character as a token, though comments are a bit
different). Thus from the scanner perspective it should be relatively easy
to use this as a preprocessor for python, or ruby, or C, or fortran, or
many other languages.

You want C expressions in your preprocessor statements. Full C expressions
include multi-special-character tokens such as ++ or +=. I don't know what
these would mean in your context, and whether you would want to include
them since they have side effects; but if so, you'd need special scanner
rules for them; with a single rule to return any single special character
as a token.

Then the parser rules would be specific to your preprocessing directives.
If somebody wants to use this for another purpose, they would need to
define their own preprocessor statement syntax.

If you are familiar with a source code control system, such as
mercurial<http://mercurial.selenic.com/>,
you can put your code on code.google.com for free. There are several other
free open source code hosting sites if you don't like google. No rush on
any of this, if you just want to play with the idea for a bit to see if you
can get it working.

If you aren't familiar with a common source code control system, and don't
mind sending a copy to me, and don't mind if I put it up on google code, I
would appreciate it! Again, no rush...

Thanks!

-Bruce

Post by j***@jptechnical.co.uk
Thanks for ideas so far. The grammer I've defined to date is
*#rom* *ID*, *int8* { *int8*, *int8* }
*#rom* *ID*, *int16* { *int16*, *int16* }
*#rom* *ID*, "*Quoted text string*"
*#rom* *ID*, """
*Block text a*
*la Python*
"""
IDs are general identifers so each rom block can be referred to by the
rest of the 'C' code. *int8*, *int16,* *locateLo *and *locateHi* are
reserved words. *int8*, *int16* and *addr *are general 'C' expressions.
I'll certainly make the source available but it's rather specific compared
to something like M4. What's the best way to do that? it's ~640 lines at
the moment.
--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To unsubscribe from this group and stop receiving emails from it, send an
To view this discussion on the web visit
https://groups.google.com/d/msgid/ply-hack/790ee177-3522-48ac-867d-d31faf4f5662%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ply-hack+***@googlegroups.com.
To post to this group, send email to ply-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ply-hack/CAEs%3D1RiVMzTpoMo%2BD1GPcG0svpyy%3DyfxrgfbPB%2BQnvsLQa4Mzw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

A.T.Hofkamp

2013-06-17 06:59:02 UTC

Permalink

Post by j***@jptechnical.co.uk
Thanks for ideas so far. The grammer I've defined to date is
*#rom* /ID/, *int8* { /int8/, /int8/ … }
*#rom* /ID/, *int16* { /int16/, /int16/ … }
*#rom* /ID/, "/Quoted text string/"
*#rom* /ID/, """
/Block text a/
/la Python/
"""

It looks like a 2 stage problem to me, where PLY could play a role.

First perform a line-based search to seperate the #rom sections from the other sections.
This can be fairly high level, you are not interested in the precise content at this stage, you are
just looking for the end of a #rom section.

Ideally you can do that with some simple string operations and/or re matching. If you have nested
delimiters (eg "{ { } }" ), you need to do some counting,
More complicated options are inspecting a token stream (possibly with counting), or even a parse,
where most of the input would map to a "unimportant text" token. If you want to preserve
white-space, also generate whitespace tokens.

At the end, you should get a list of #rom and #non-rom sections.

Now that you know what text is a #rom section, you can address parsing of separate #rom sections. I
suspect you'll need C expression parsing, to avoid counting "{ 1, f(2,3,4) }" as 4 vaules (which
happens if you do line.split(',') )

Albert

--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ply-hack+***@googlegroups.com.
To post to this group, send email to ply-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ply-hack/51BEB3B6.3080304%40tue.nl.
For more options, visit https://groups.google.com/groups/opt_out.

j***@jptechnical.co.uk

2013-06-19 12:23:07 UTC

Permalink

Thanks for all the replies and ideas, plenty to think about.

I've just found and read the section 4.18 'Conditional lexing and start
conditions' in 'ply.html'. This is about conditional lexing and states in
the lexer so probably the way to go. My initial thought is to have one
state that matches a 'token' up to a #rom token (easy enough with a Python
re), pass all this other 'C' code to the parser as the single token and
switch lexer state to a #rom tokenising state. At the end of the #rom
statement a parser rule action will have to switch the lexer state back.
This approach has the advantage that all output goes through the parser and
all interaction with the lexer is at the published interface level.

Definitely make the '#rom locateHi|Lo @ addr' statement explicitly
terminated. ';' is 'C' like and seems good to me. Hopefully this will stop
the parser having to read a lookahead token as dangyogi pointed out.

Will do these mods but not sure when.

I don't know if it would be feasible to turn it into a general purpose
pre-processor as an m4 alternative but it could certainly be a customisable
frame work.

Thanks again everyone.
John

--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ply-hack+***@googlegroups.com.
To post to this group, send email to ply-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ply-hack/5d3ea75c-e02c-4ba2-bc08-f1d0c6619446%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

j***@jptechnical.co.uk

2013-07-08 15:24:55 UTC

Permalink

Made all the mods to the preprocessor so it handles free form text between
#rom statements. Went to put the code in code.google as dangyogi suggested.
In creating the project it asks which license to use. Suggestions as to the
best one to use appreciated as I new to open source publishing.
Many thanks,
John

--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ply-hack+***@googlegroups.com.
To post to this group, send email to ply-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ply-hack/c32d14f6-7c42-487e-8777-47fd2a5cec70%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Bruce Frederiksen

2013-07-08 18:37:34 UTC

Permalink

I'm certainly not an expert on the various licenses. My limited
understanding is that if you don't care what anybody does with it,
including selling it as a commercial product, then BSD or MIT are two that
are short and sweat (I generally use BSD, but forget why).

If you want to limit its use only within other open source projects, then
GNU GPL or GNU Lesser GPL. These both require anybody redistributing the
code to keep it under the original license. The Lesser license is for
libraries to allow use of the libraries in other programs without
restrictions while retaining the limitations for the source code. I think
that the GNU licenses have some gotchas that aren't well understood when a
user wants to use your GPL project along with some other non-GPL project,
so might scare some people off. I don't believe that GPL prevents somebody
from selling your project commercially, so long as they offer the sources
to anybody who asks for them for free under the same license. Not sure how
anybody could make money this way, except maybe by selling support?

The apache license is another popular one. Not sure what it's limitations
are.

Sorry I couldn't be more helpful. :-/

-Bruce

Post by j***@jptechnical.co.uk
Made all the mods to the preprocessor so it handles free form text between
#rom statements. Went to put the code in code.google as dangyogisuggested. In creating the project it asks which license to use.
Suggestions as to the best one to use appreciated as I new to open source
publishing.
Many thanks,
John
--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To unsubscribe from this group and stop receiving emails from it, send an
To view this discussion on the web visit
https://groups.google.com/d/msgid/ply-hack/c32d14f6-7c42-487e-8777-47fd2a5cec70%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ply-hack+***@googlegroups.com.
To post to this group, send email to ply-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ply-hack/CAEs%3D1Rjakj843c2pUW%2BZUvchFY%2Bks%2BR0B4mb2oAHr5RGurcwbQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

j***@jptechnical.co.uk

2013-07-09 23:50:29 UTC

Permalink

The preprocessor is good enough to use now and I've annotated it with
plenty of comments. Source is available at
http://code.google.com/p/ccs-preprocessor/. Usage and source file format
comments are extracted and put in the wiki.<http://code.google.com/p/ccs-preprocessor/>

I used an extra, exclusive, state, 'ccode', in the lexer to handle free
form text between #rom statements. To switch between lexer states 'ccode'
and 'INITIAL' (which handles the #rom statements) empty productions are
used immediately before the terminating token of each state. The action of
these productions is to switch lexer states. That way the lexer state is
changed immediately the terminating token of a state has been taken by the
parser.

Interestingly the code to do the command line decoding and #rom statement
processing (the *RomHandler *class) are relatively large compared to the
lexer and parser code. Considering what the parser is doing that seems to
me a credit to lex/yacc, PLY and of course Python.

I plan to convert the module into a more Pythonic OO framework suitable for
building such preprocessors more quickly than starting from scratch. But
that will have to be a couple of months away.

Comments invited,
John

<http://code.google.com/p/ccs-preprocessor/>

--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ply-hack+***@googlegroups.com.
To post to this group, send email to ply-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ply-hack/58cc7302-5352-4242-bcc8-f4d385f286e9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.