a Boolean query parser in Python
Last summer I worked on a project where I needed a search engine that supported simple Boolean queries and key-value parameters. This was to be used on a Python site, so I wanted to write the query parser in Python too. At the time, Lepl seemed like a good choice. I figured out the grammar necessary to support the search query language I wanted, and all was good. I went to check out Lepl just now and I see it has been discontinued, which is a little disheartening. Regardless, I figured better late than never to post this since it still works and Lepl did the job.
The goal was to be able to take a query like “a AND b OR c” and interpret that as “both a and b, or c”, with the AND binding tighter than the OR. Key-value parameters separated by a colon also needed to be supported. Strings in quotes also needed to be kept together. So a query like ‘cat:5 AND tag:”fun stuff” OR id:4 OR is_posted’ would mean I was searching for data such that:
- The
cat
property was 5 and thetag
property was “fun stuff”; OR - The
id
property was 4; OR - It had the
is_posted
property
This is for some hypothetical set of data, maybe pertaining to a blog. Here’s the grammar I came up with:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
from lepl import * # AND one or two elements in a list def ander(result): # Two elements in the list => a tuple if len(result) == 2: return (result[0], result[1]) return result[0] # Text is either a string or a word # Use Lepl's built-in String and Word # http://www.acooke.org/lepl/api/lepl.matchers.derived-module.html#String # http://www.acooke.org/lepl/api/lepl.matchers.derived-module.html#Word text = String() | Word() # An AND clause can contain another AND clause, so delay the definition of this andClausePrime = Delayed() # The key in a key:value parameter, excluding the colon label = text & Drop(':') # Skip extra spaces with DroppedSpace(): # Grab the key and value in a key:value parameter (which consists of a label, defined above, followed by some text, defined above) and stuff them into a dictionary such that key => value parameter = label & text > (lambda r: {r[0]: r[1]}) # An AND clause consists of either a key:value parameter or some text, followed by that tricky andClausePrime that we haven't defined; stuff all this into the ander function defined above and let it spit out the result andClause = (parameter | text) & andClausePrime > ander # Now we can define the delayed andClausePrime since we know what an andClause is. We expect the key word 'AND', which we discard as soon as we see it, then either an andClause or a key:value parameter or some text, followed by this whole thing again. The [:] is greedy, meaning we expect zero or more of these, and take as many as possible. andClausePrime += (Drop('AND') & (andClause | parameter | text) & andClausePrime)[:] # An expression is an AND clause, a key:value parameter, or some text expr = andClause | parameter | text # This sums up all the possible queries we will allow. We get one or more expressions, OR'd together. query = expr & (Drop('OR') & expr)[:] |
Here is some test input and output:
- a AND b
[('a', 'b')]
- a AND b AND c
[('a', ('b', 'c'))]
- a AND b AND c OR e AND f
[('a', ('b', 'c')), ('e', 'f')]
- a AND b AND c OR e OR f
[('a', ('b', 'c')), 'e', 'f']
- “bar none” OR foo
['bar none', 'foo']
- cat:5 AND tag:”fun stuff” OR id:4 OR is_posted
[({'cat': '5'}, {'tag': 'fun stuff'}), {'id': '4'}, 'is_posted']
- key:value AND “hey now”:5 OR “what is”:up
[({'key': 'value'}, {'hey now': '5'}), {'what is': 'up'}]
You can see we end up with ANDed terms in a tuple and ORed terms in a list. Key-value parameters are in dictionaries.
Hopefully this code will be of use to someone. You can also check out my original question on Stack Overflow to see my process in getting to the grammar above.