Power Matching: Using Regular Expressions

When i started working with Apache JMeter, the documentation kept referencing something called Regular Expressions. In fact, I knew so little about this space that when a friend referred to Regular Expressions as “RegEx,” I wondered what he was talking about. Slowly i taught them to myself with the help of Wikipedia and a friend of mine. In this article, i will walk you through the basics of regular expression most commonly known as RegEx which plays a pivotal role in handling dynamic data in JMeter.

What are Regular Expressions, anyway?

Regular Expressions are a sequence of symbols and characters expressing a string or pattern to be searched for within a longer piece of text.  Regular Expressions are about “power matching.” Lets say if you want match the set containing the three strings “Handel“, “Händel“, and “Haendel” can be specified by the pattern H(ä|ae?)ndel; we say that this pattern matches each of the three strings. That is power matching using RegEx.

How do i learn about RegEx ?

Ultimately, understanding and writing Regular Expressions (RegEx) is a little bit like getting your first job. You can’t get hired without experience, and you can’t get experience without getting hired. With RegEx, you don’t really  understand them until you use them, and you can’t really use them until you understand them. So you have to learn a little bit, and then use a little bit and get them wrong, and then go back to the book and learn a little bit more. The other problem you will have with RegEx is that each character is easy. Put them all together and you get this:

/\?cid=[0-9]{3,3}

And that one wasn’t very hard. The more you work with them, the easier they’ll get.So master each step, put a couple together, make some mistakes and get going. Soon you’ll be a RegEx pro.

The Backslash(\)

I always encourage people to start their “RegEx career” by learning the characters, and the best one to start with is the backslash. A backslash is different from all the other characters, as you will see. It provides a bridge between Regular Expressions and plain text. A backslash “escapes” a character. What does “escape” mean? It means that it turns a Regular Expression character into plain text. If that doesn’t make sense to you yet, hold on – I have a few examples coming.

Perhaps i have a goal to match “/folder?pid=123″ in the response body. The problem we have is that the question mark already has another use in Regular Expressions – but we need for it to be an ordinary question mark. (We need it to be plain text.) We can do it like this:

/folder\?pid=123

Notice how there is a backslash in front of the question mark – it turns it into a plain question mark.

Backslash can be used to turn special RegEx characters into everyday, plain characters.

The Pipe (|)

The pipe is the simplest of Regular Expressions, and it is the one that all Regular People (that’s you and me) should learn. It means ‘or’. Notice that the pipe matches everything on either side of it.

Here’s a simple example:  Lets say you have a page response and you want match all the occurrences of words named Coke and Pepsi.  you could create your expression like this:

Coke|Pepsi

The Question Mark (?)

A question mark means, “The last item (which, for now, we’ll assume is the last character) is optional.” In other words, the question mark makes the preceding token in the regular expression optional.

Imagine we have a page with combination of words “colour” and “color” and our goal is to match the occurrence of all such words in the page. You could create an expression like this:

colou?r: while matching for the words, the question mark before checks 
for the atleast zero or one occurrence of letter 'U' in the word.

With the above expression, you will be able to match both ‘color’ and ‘colour’  from the page.

Parentheses ()

Parentheses in Regular Expressions work the same way that they do in mathematics. This falls into the category of “Things I should have learned had I been paying attention in grade school.” Lets understand this with an example:

/folder(one|two)/thanks

This matches two URLs, folderone/thanks and foldertwo/thanks. OK, on with the explanation. Remember, we were talking about things we should have learned had we been paying attention in school. Remember how your math teacher said that if you had an equation, the division and multiplication got done before the subtraction and addition? Well, since I wasn’t paying attention in Mrs. Rani’s 4th-grade class, I pulled out my old notes, and here is what I found:

2 + 3 x 5 = 17 (Right? 3 times 5 equals 15, plus 2 equals 17.)

If you wanted it to execute differently, you had to force the equation with parentheses – like this:

(2 + 3) x 5 = 25

Above, I’ve changed the same numbers to become 2 plus 3 equals 5, times 5 equals 25. And that’s the value of parentheses in math. I see from my very old notes that Mrs. Rani called it the Order of Operations.

So what about Regular Expressions? Why would we need parentheses there? In order to understand our need, we have to look at other expressions (just like we had to understand the math operations symbols in order to understand why parentheses are needed.)

Let’s use pipes as our example. I wrote that this expression: coke|pepsi . Means everything on one side of it (coke) or everything on the other, i.e. pepsi.

But as we start to think about why we would want to use parentheses, we can revisit that example above and ask ourselves, “What happens when we don’t want to grab everything on either side of the pipe?” Like this example:

/foldertwo/thanks
/folderone/thanks
A great way to represent this in RegEx would be :
/folder(one|two)/thanks

So we are allowing the RegEx to match either the thanks page in folderone or the thanks page in foldertwo – and it is the parentheses that allow us to group so that the pipe knows what to choose between.

This next example is a little different. Again, we’re going to roll two URLs into one goal, but this time, we use the parentheses to tell the question mark what is optional. This website has three thank-you pages:

/thanks
/thankyou
/thanksalot

If we only want the /thanks and the /thanksalot pages to be part of our goal, we could do it like this: /thanks(alot)? This means, the target string must include /thanks, but alot is optional. So it matches both /thanks and /thanksalot. And the /thankyou page will never get included, because there is no s in its URL (so it doesn’t match the beginning of this RegEx, i.e.thanks).

Characters that are usually special, like $ and ?, no longer are special inside of square brackets. The exceptions are the dash, the caret (more on this one later) and the backslash, which still works like all backslashes do inside the square brackets

Square Brackets & Dashes

With square brackets, you can make a simple list, like this: [aiu] . This is a list of items and includes three vowels only. Note: Unless we use other expressions to make this more complicated, only one of them will work at a single.

So p[aiu]n will match pan, pin and pun. But it will not match pain, because that would require us to use two items from the [aiu] list, and that is not allowed in this simple example. You can also use a dash to create a list of items, like this:

[a-z] – all lower-case letters in the English alphabet
[A-Z] – all upper-case letters in the English Alphabet
[a-zA-Z0-9] – all lower-case and upper-case letters, and digits.

Dashes are one way of creating a list of items quickly, as you can see above.(Notice they are not separated by commas.)

Here is an example of how you might use square brackets by themselves. Let’s say you have a product group, sneakers, and each product name has a number appended to it in the URL (we see this a lot with industrial products where they don’t have zippy names). So you might have sneakers450, sneakers101, etc. Now you could write an Expression that matches the products using square brackets and dashes.

sneakers[0-9]: this expression matches the products with name 'sneaker' 
followed by digits between 0 and 9. like  sneakers450, sneakers101 etc.

 Braces {}

Braces repeat the last “piece” of information a specific number of times. They can be used with two numbers, like this: {1,3}, or with one number, like this: {3}. When there are two numbers in the braces, such as {x,y}, it means, repeat the last “item” at least x times and no more than y times. When there is only one number in the braces, such as {z}, it means, repeat the last item exactly z times.

Here is my example: Lots of companies want to take all visits from their IP address out of their analytics. Many of those same companies have more than one IP address – they often have a block of numbers. So let’s say that their IP addresses go from 123.105.169.0 through 123.105.169.99 – how would we capture that range with braces? Our regular expressions would be:

123\.105\.169\.[0-9]{1,2}

Notice that we actually used four different RegEx characters: We used a backslash to turn the magic dot into an everyday dot, we used brackets as well as dashes to define the set of allowable choices, i.e. the last “item”, and we used braces to determine how many digits could be in the final part of the IP address.

On the other hand, if there is only one number in the braces, the match will only work if you have exactly the right number of characters

Here is my example: Lets say we have a scenario where, a page has 10 digit mobiles number are displayed and our goal is to match all the mobile number using regEx. You could write a regEx as below:

\d{10}:\d(more on this later) is basically used to match a single digit.
This regEx exactly matches 10 digits as defined in the braces.

The Dot (.)

A dot matches any one character. Ultimately represents numeric, alpha, special character. And a dot even matches a whitespace. I also learned there aren’t that many uses for dots by themselves, but they are very powerful when combined with other RegEx characters. Let me start with some examples of how the dot can be used alone. Take this Regular Expression:

.ite

It would match site, lite, bite, kite. It would also match %ite and #ite (because % and # are characters, too.) However, it wouldn’t match ite. Why not? A dot matches one character, and ite includes zero characters for the dot to match (i.e., it didn’t match any).

The Plus Sign (+)

A plus sign matches one or more of the former items, which, as usual, we’ll assume is the previous character. (It can be more complicated, but let’s start easy.) So the list of possible matches is clear: the former character. And the number of matches is clear: one or more.

Here’s an example from the world of literature: When a character trips and falls on his face, he often says Aaargh! Or maybe it’s Aaaargh! or just Aargh! In any case, you could use a plus sign to match the target string, like this: ‘aa+rgh’. That will match aargh and aaargh and aaaaaaaaargh. Well, you understand. Notice, however, that it won’t match argh. Remember, it is one or more of the former items.

The Star (*)

People really misuse stars. They have specific meanings in other applications. They don’t mean the same thing in RegEx as they do in some of those other applications, so be careful here.

Stars will match zero or more of the previous items. They function just like plus signs, except they allow you to match ZERO (or more) of the previous items, whereas plus signs require at least one match. For the time being, let’s just define “previous item” as “previous character.”

Since stars are so much like plus signs, I’ll start with the same example and point out the differences. So once again, when a character trips and falls on his face, he often says Aaargh! Or maybe it is Aaaargh! or (unlike last time) just Argh! In any case, you could use a star to match the target string, like this: aa*rgh. That will match aargh and aaargh and aaaaaaaaargh – and the big difference from the plus sign is that it will also match argh (i.e. no extra “a’s” added).

The Dot Star (.*)

There are two Regular Expressions that, when put together, mean “get everything.” They are a dot followed by a star, like this:

/folderone/.*index\.php

In this example, our Regular Expression will match to everything that starts with folderone/ and ends with index.php . This means if you have pages in the /folderone directory that end with .html, they won’t be a match to the above RegEx.

Now that you have an example, you might be interested in why this works. A dot, you may remember, means get any character. A star means repeat the last character zero or more times. This means that the dot could match any letter in the alphabet, any digit, any number on your keyboard. And the star right after it matches the ability of the dot to match any single character, and keep on going (because it is zero or MORE) – so it ends up matching everything. Hard to wrap your head around this one? Trust me, it works.

The Caret (^)

When you use a caret in your Regular Expression, you force the Expression to match only strings that start exactly the same way your RegEx does. Lets under this with a simple example: My website page contains strings that start with a number at the beginning such as:

123RTGS
114FTUR
143MISS

Our goal is to match only the digits at the staring of the string. You could write a regEx that could server our purpose:

(^\d+): ^ indicates beginning of the string.
        \d denotes matching a single digit.
        +  denotes matching of one or more digits.

Caret can also be used in a different way i.e [^] : Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than “a”, “b”, or “c”. [^a-z] matches any single character that is not a lowercase letter from “a” to “z”. Likewise, literal characters and ranges can be mixed.

When you put a caret inside square brackets at the very beginning, it means match only characters that are not right after the caret. So [^0-9] means if the target string contains a digit, it is not a match.

The Dollar Sign ($)

A dollar sign means don’t match if the target string has any characters beyond where I have placed the dollar sign in my Regular Expression. Lets understand this with an example, i have a webpage that has strings like: *.txt, *.txtPhp, *.php, *.html . Our goal is to match only the strings that ends with .txt. You could write the below expression to achieve your goal:

.*\.txt$

Whitespace (\s)

Matches a whitespace character, which in ASCII are tab, line feed, form feed, carriage return, and space; in Unicode, also matches no-break spaces, next line, and the variable-width spaces (amongst others).

Digits(\d)

Matches a single digit,same as [0-9] in ASCII; When used with combination of regular expression such as \d+ will match one or more digits in a line or string.

This is pretty much all we have on Regular Expressions right now. If you enjoy reading this article then you can subscribe our updates for FREE, just add your email id . I will keep on updating the article for latest testing information. Subscribe and stay tuned for updates, there’s lot more to come.

🙂 Happy Performance Testing !!  🙂

 

Advertisements

One thought on “Power Matching: Using Regular Expressions

  1. Hi
    I am stuck in middle of Token and Authtoken. My load test scenario is first i need to login into the system. At that time a token is generate. Let’say – {“ModuleCode”:”Widget”,”SubModuleCode”:”CompanyAnnouncement”,”ActionCode”:””,”ReasonCode”:””,”SubscriptionName”:”hdfc”,”Token”:”1e1ce692-63b4-411f-afc6-722ccd5d7e4e”,”TokenValue”:”hdfc”}.

    And then their is panel in system- when i click on that it redirect me to new tab with a new token. Let say – {“Token”:”9bc2fd50-d022-4124-9975-50a68f9e9dff”,”TokenValue”:”hdfc”,”CategoryID”:”2″,”Module”:”EmployeeGeneralInfo”,”Method”:”TimeStampDifference”,”Program”:””,”TransType”:””}

    Again i need to go back previous tab to do logout.

    How can correlate these 2 token?

    Thanks in advance.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s