Regular Expressions with Greedy Sed

Regular expressions are scary to many. I overcame that fear when I ended up having to use them in my PHP scripting. The complexities of regular expressions are many and because of these complexities, the flexibility is enormous.

I found an old XML file in my drive which was 100kb in size and with a single line of text, it was horrible. However, I managed to use *sed* to enter newlines where they should be and processed it that way. It still wasn’t enough, I had to create a sed regular expression that would extract two pieces of data from a single line. Consider the following example:

<Element attrib1="prop1" attrib2="prop2" attrib3="prop3" attrib4="prop4"><Child><Child2 attrib5="EXTRACT1"/></Child><Sibling><Child3 attrib5="prop5"/></Sibling><Sibling2 attrib6="prop6">EXTRACT2</Sibling2></Element>

(Note that this is just an example, and not the actual XML data in the file)

What I wanted to do is extract the word EXTRACT1 with the text after the last closing tag, EXTRACT2. The final sed command I used was this:

sed -e ‘s#^.*attrib5="\([^"]*\)".*attrib5[^>]*><[^>]*><[^>]*>\([^<]*\)<.*$#\1 – \2#g’ file

Here is what all that means. To begin, here are the matches and their corresponding regular expression parts in a table:

Regular Expression Matched Text
^.* <Element attrib1="prop1" attrib2="prop2" attrib3="prop3" attrib4="prop4"><Child><Child2
attrib5=" attrib5="
\([^"]*\) EXTRACT1
".*attrib5 "/></Child><Sibling><Child3 attrib5
[^>]*> ="prop5"/>
<[^>]*> </Sibling>
<[^>]*> <Sibling2 attrib6="prop6">
\([^<]*\) EXTRACT2
<.*$ </Sibling2></Element>
  • The First part which is ^.* means, match the start of the string up to any text after it.
  • The attrib5=" matches exactly that string.
  • The \([^”]*\) is where it gets entertaining.
    • First of all the \( escapes the parentheses since I’m using the sed in a command line interface. The same goes for the \) at the end.
    • The brackets in the [^”]* are used to group what is inside them, which means not to match a double quote. This is important because sed (as well as grep) are greedy in their matching which means that they will not stop after the first matched string, but the last matched string. Since I want the match to stop when reaching a double quote (non inclusive) I’m telling it to match everything except a double quote, which will make it stop at the first double quote it encounters. It actually took me quite a while to figure this out, but as you will see, this pattern keeps repeating in the regular expression.
  • The “.*attrib5 matches a quotation and anything after it until reaching the other occurrence of attrib5. This is important again because sed is greedy and would match the this instance of attrib5 rather than the previous one if I didn’t include it. 
  • The [^>]*> matches everything up to a >. Which basically matches everything after attrib5 until the first >.
  • The next two <[^>]*> match the next two xml tags, including all text in these tags and opening and closing <>.
  • The \([^<]*\)< will match the text up until the <.
  • The .*$ will then match everything up until the end of the line.
Here are some basic regular expression special symbols

The ^ or the caret as it is called is a negation. It simply means do not to match any of the following. If it is used in the beginning of the regular expression then it means to match the beginning of the string.

The * or also known as star or asterisk (as it should be called) is used to match many of the previous item. The previous item can be a string of characters or an expression surrounded by opening and closing brackets like [ and ].

Full Arabic Support (RTL) on AuraxTSense 8.4 on HTC Desire w/ Data2SD

Edit: 3/4/2011 There is an improved Arabic support file, the link has been update. Also after flashing it, and rebooting, the screen may be black for quite some time. Just wait for it, and it should start working again.

Only recently have I gotten the guts to root my HTC Desire and play around with it. It was working fine before, but what really pushed me root it was the internal memory issue. What is this internal memory issue you may ask, well the lovely (insert sarcasm here) people at HTC decided to make the internal flash 512 MB in size. What happened was it filled up very quickly!

With about 15 MB left I started getting error messages that my disk space was low. I would eventually uninstall applications, and remove e-mail accounts and such to free up memory because as soon as it starts giving me the low disk space error message it doesn’t sync anymore!

Something had to be done, and the first thing to do was to root the phone!

I researched and found the perfect hassle-free method to root the phone. One word: Unrevoked. I went to the website chose my phone and started the rooting process. I of course backed up my device and sdcard, multiple times, with different backup software too. I gave it a go and it worked, flawlessly.

Now that the phone was rooted, clockworkmod recovery was installed. You can boot into clockworkmod and do multiple things such as backup and restore, install a ROM or another zip file to your system, and a bunch of other things as well. The backup option with clockworkmod is something you should always do before you install any sort of flash or ROM or zip. Trust me, there have been many times when I found it more than useful and it helped me recover my data very nicely.

The next thing I did was look for a ROM to install and try to fix this low storage issue. After a lot of online research I found AuraxTSense, I followed the link, downloaded the zip file and followed the steps. The first time I forgot to wipe everything so it hung on me, did it again but with a wipe and factory reset and it worked great.  It originally just comes with APP2SD and it helped with the storage issue a little, but soon enough the problem persisted! I was getting annoyed, very annoyed! I decided I needed a better solution.

That’s when I discovered DATA2SD. The first posts I found talked about how it was still in beta and not very stable, but after a lot of research I found that Starburst ROM was releasing a very stable version of it, so I gave it a shot. Long story short, it worked great but I had an issue with restoring my data and so I went back to AuraxTSense and I was determined to get DATA2SD to work on the AuraxTSense! After countless searching I found the thread on DATA2SD on the XDA developer forums which after following the instructions also worked perfectly!

I was happy. I had a lot of internal storage now and I was not having any major issues. However, one issue was still there, it was always there and I since I’ve come this far why not continue and find a solution for the Arabic font not being connected and not being RTL. I searched quite a bit and tried some solutions out there and some of them didn’t really work, until I finally found a solution on an Arabic forum. You have to register to download the file but here is the link for the Arabic support for AuraxTSense 8.4.

The whole process of getting all this to work took me about a few weeks but I’m finally done and everything seems to be working really well. I don’t develop any of these applications so I can’t really answer complicated questions but if you happen to have any about how I did something or want to know more about my experience then ask away.

If you are here just to know how to do get this all done, click each of the four links above and follow the instructions.