100 groups in python re limitation slows down PLY performance

Discussion:

viy

2012-08-17 12:52:09 UTC

Hi all, jfyi

I've added just one token to my lexer rules and stuck in 100 groups limit
in python re
http://stackoverflow.com/questions/478458/python-regular-expressions-with-more-than-100-groups

PLY has workaround in its code - when your master re exceeds 100 groups,
PLY catches AssertionError from python, splits master re into parts and
retries.

All works smoothly, but in my case my unit tests suite became 10x slower.
Single parsing is about 1.5x slower.

The solution is obvious - to get rid of the python limitation.
Does anyone know the best way to do so?

Thanks in advance.
--
Best regards,
Eugene Voytitsky

--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To post to this group, send email to ply-***@googlegroups.com.
To unsubscribe from this group, send email to ply-hack+***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/ply-hack/-/-xQgpXz-4ggJ.
For more options, visit https://groups.google.com/groups/opt_out.

A.T.Hofkamp

2012-08-20 15:18:40 UTC

Permalink

Post by viy
Hi all, jfyi
I've added just one token to my lexer rules and stuck in 100 groups limit in
python re
http://stackoverflow.com/questions/478458/python-regular-expressions-with-more-than-100-groups
PLY has workaround in its code - when your master re exceeds 100 groups, PLY
catches AssertionError from python, splits master re into parts and retries.
All works smoothly, but in my case my unit tests suite became 10x slower.
Single parsing is about 1.5x slower.
The solution is obvious - to get rid of the python limitation.
Does anyone know the best way to do so?

Re-implement RE? :D
Much happiness would spread throughout the Python community, I am sure :)

Other solutions include

0. Live with it. Other solutions may cost more time than you are ever going to save.

1. DIY: You can easily define your own scanner, using arbitrary Python code. Just make sure you
match the interface. String scanning is relatively easy, it just takes a lot of code.

2. A long time ago (several years at least), someone wrote a Lex framework. I forgot about the
details, but the mailinglist archive or google can probably help you. Iirc, it was a true lex, and
had a different approach than using RE.

3. More exotic solutions like writing a scanner C extension (generated with lex/flex) are also possible.

4. Even more exotic stuff like generating a DFA somehow, and implementing that in Python can be done.

4. Other Python parser generators may have better solutions (I somewhat doubt it, but it should be
easy enough to scan through it, checking for how the scanner works)

Good luck

Albert

--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To post to this group, send email to ply-***@googlegroups.com.
To unsubscribe from this group, send email to ply-hack+***@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Eugene Voytitsky

2012-08-20 15:45:09 UTC

Permalink

Thanks, Albert.

It seems that I have really only 2 options:
0. Live with it, or
1. Recompile the Python without the checking of that limitation. (And
you are right – I should ask python community. But I decided to share
info here.)

Other options don't suite, because of a lot of work was been done and
code already in production.

Post by A.T.Hofkamp

Re-implement RE? :D
Much happiness would spread throughout the Python community, I am sure :)
Other solutions include
0. Live with it. Other solutions may cost more time than you are ever going to save.
1. DIY: You can easily define your own scanner, using arbitrary Python
code. Just make sure you match the interface. String scanning is
relatively easy, it just takes a lot of code.
2. A long time ago (several years at least), someone wrote a Lex
framework. I forgot about the details, but the mailinglist archive or
google can probably help you. Iirc, it was a true lex, and had a
different approach than using RE.
3. More exotic solutions like writing a scanner C extension (generated
with lex/flex) are also possible.
4. Even more exotic stuff like generating a DFA somehow, and
implementing that in Python can be done.
4. Other Python parser generators may have better solutions (I somewhat
doubt it, but it should be easy enough to scan through it, checking for
how the scanner works)
Good luck
Albert

--
Best regards,
Eugene Voytitsky
--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To post to this group, send email to ply-***@googlegroups.com.
To unsubscribe from this group, send email to ply-hack+***@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Joseph S. Tate

2012-08-20 16:00:53 UTC

Permalink

Have you tried either of the non-stdlib re replacements? I note there
are at least two: re2 and regex. Seems like a monkey patching wrapper
on either of those would be easier to distribute than a patched
version of python.

Besides the 100 groups limit, re also has limitations on total regex
length depending on compiled character width.

Post by Eugene Voytitsky
Thanks, Albert.
0. Live with it, or
1. Recompile the Python without the checking of that limitation. (And you
are right – I should ask python community. But I decided to share info
here.)
Other options don't suite, because of a lot of work was been done and code
already in production.

Post by A.T.Hofkamp

Re-implement RE? :D
Much happiness would spread throughout the Python community, I am sure :)
Other solutions include
0. Live with it. Other solutions may cost more time than you are ever going to save.
1. DIY: You can easily define your own scanner, using arbitrary Python
code. Just make sure you match the interface. String scanning is
relatively easy, it just takes a lot of code.
2. A long time ago (several years at least), someone wrote a Lex
framework. I forgot about the details, but the mailinglist archive or
google can probably help you. Iirc, it was a true lex, and had a
different approach than using RE.
3. More exotic solutions like writing a scanner C extension (generated
with lex/flex) are also possible.
4. Even more exotic stuff like generating a DFA somehow, and
implementing that in Python can be done.
4. Other Python parser generators may have better solutions (I somewhat
doubt it, but it should be easy enough to scan through it, checking for
how the scanner works)
Good luck
Albert

--
Best regards,
Eugene Voytitsky
--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To unsubscribe from this group, send email to
For more options, visit https://groups.google.com/groups/opt_out.

--
Joseph Tate
Personal e-mail: jtate AT dragonstrider DOT com
Web: http://www.dragonstrider.com

--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To post to this group, send email to ply-***@googlegroups.com.
To unsubscribe from this group, send email to ply-hack+***@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

David Beazley

2012-08-20 16:06:20 UTC

Permalink

Having just looked at the Python source, the 100 group limit is literally just hard coded into Modules/_sre.c and a few other places. Without looking at it more closely, it is hard to imagine a reason why this is needed. Plus, if you're going to hard-code a limit, surely it could be bumped up to something substantially larger like 1000 or even 10000 on a modern machine (assuming the limit was there for memory).

Wonder if this could be pushed for Python 3.3 (still in beta).

Cheers,
Dave

Post by Eugene Voytitsky
Thanks, Albert.
0. Live with it, or
1. Recompile the Python without the checking of that limitation. (And you are right – I should ask python community. But I decided to share info here.)
Other options don't suite, because of a lot of work was been done and code already in production.

Post by A.T.Hofkamp

Re-implement RE? :D
Much happiness would spread throughout the Python community, I am sure :)
Other solutions include
0. Live with it. Other solutions may cost more time than you are ever going to save.
1. DIY: You can easily define your own scanner, using arbitrary Python
code. Just make sure you match the interface. String scanning is
relatively easy, it just takes a lot of code.
2. A long time ago (several years at least), someone wrote a Lex
framework. I forgot about the details, but the mailinglist archive or
google can probably help you. Iirc, it was a true lex, and had a
different approach than using RE.
3. More exotic solutions like writing a scanner C extension (generated
with lex/flex) are also possible.
4. Even more exotic stuff like generating a DFA somehow, and
implementing that in Python can be done.
4. Other Python parser generators may have better solutions (I somewhat
doubt it, but it should be easy enough to scan through it, checking for
how the scanner works)
Good luck
Albert

--
Best regards,
Eugene Voytitsky
--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "ply-hack" group.
To post to this group, send email to ply-***@googlegroups.com.
To unsubscribe from this group, send email to ply-hack+***@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.