this post was submitted on 24 Dec 2023
142 points (94.9% liked)

Linux

47237 readers
3343 users here now

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

founded 5 years ago
MODERATORS
142
submitted 8 months ago* (last edited 8 months ago) by [email protected] to c/[email protected]
 

I've been reading Mastering Regular Expressions by Jeffrey E.F. Friedl, and since nobody in my life (aside from my wife) cares, I thought I’d share something I'm pretty proud of. My first set of regular expressions, that I wrote myself to manipulate the text I'm working with.

What’s I’m so happy about is that I wrote these expressions. I understand exactly what they do and the purpose of each character in each expression.

I've used regex in the past. Stuff cobbled together from stack overflow, but I never really understood how they worked or what the expressions meant, just that they did what I needed them to do at the time.

I'm only about 10% of the way through the book, but already I understand so much more than I ever did about regex (I also recognize I have a lot to learn).

I wrote the expressions to be used with egrep and sed to generate and clean up a list of filenames pulled out of tarballs. (movies I've ripped from my DVD collection and tarballed to archive them).

The first expression I wrote was this one used with tar and egrep to list the files in the tarball and get just the name of the video file:

tar -tzvf file.tar.gz | egrep -o '\/[^/]*\.m(kv|p4)' > movielist

Which gives me a list of movies of which this is an example:

/The.Hunger.Games.(2012).[tmdbid-70160].mp4

Then I used sed with the expression groups to remove:

  • the leading forward slash
  • Everything from .[ to the end
  • All of the periods in between words

And the last expression checks for one or more spaces and replaces them with a single space.

This is the full sed command:

sed -Eie 's/^\///; s/\.\[[a-z]+-[0-9]+\]\.m(p4|kv)//; s/[^a-zA-Z0-9\(\)&-]/ /g; s/ +/ /g' movielist

Which leaves me with a pretty list of movies that looks like this:

The Hunger Games (2012)

I'm sure this could be done more elegantly, and I'm happy for any feedback on how to do that! For now, I'm just excited that I'm beginning to understand regex and how to use it!

Edit: fixed title so it didn’t say “regex expressions”

top 46 comments
sorted by: hot top controversial new old
[–] [email protected] 82 points 8 months ago (1 children)

Knowledge and understanding. Feels good, man.

Obligatory Xkcd.

[–] [email protected] 21 points 8 months ago (1 children)

Ah....the days when perl was the shit and python was still a glimmer in the eye of some frustrated programmer.

[–] [email protected] 62 points 8 months ago (1 children)

I relearn regex from scratch every time I need to use it.

[–] [email protected] 3 points 8 months ago

This is the way.

[–] [email protected] 25 points 8 months ago* (last edited 8 months ago) (4 children)

Good job !

I highly recommend trying out the various online regex editor.

These WISIWIG kind of editors are great because you immediately see what the regex is catching and for what reason.

I took the first one in my search results but try different ones.

https://regex101.com/

Also I used GPT to get some regex for some specific strings and it can be helpful to get a quickstart at building a specific regex.

In that case I was building a regex for a specific log from postfix.

PS: just make sure to select the correct flavor of regex you are using in these online tools.

Edit: Also one of my favorite YT channels has pretty cool videos on RegEx : https://youtu.be/6gddK-cOxYc?si=0bnNkSDzifjdxwjU

[–] [email protected] 6 points 8 months ago

Regex101 is amazing. It tends to balk at backtracing which we rely on a lot for work, but it's such a good visual.

Chat GPT can also save a lot of time writing regex, but it tends to write very unreadable regex because it thinks it's being clever when it really isnt.

Regex is an art form, and writing readable regex is another step above that.

[–] [email protected] 2 points 8 months ago

Piggybacking onto this to mention my go-to online RegEx editor: RegExr. It lets you test the regex as you type, explains the particular symbols used, as well as has a sidebar where you can see different pattern types categorically. I've been using it for almost 2 years now, and haven't had any reason to use much else (after I discovered this).

[–] [email protected] 2 points 8 months ago

Computerphile! I’ll check those out.

[–] [email protected] 1 points 8 months ago* (last edited 8 months ago)

Thank you very much. I will definitely check out the regex builders. That’ll be super useful

Edit: fix stupid autocorrect turning regex into Reyes.

[–] [email protected] 14 points 8 months ago (1 children)

Just to chip in because I haven't seen it mentioned yet, but I fing LLMs like ChatGPT or Microsoft Copilot are really good at making regexes and also at explaining regexes. So if you're learning them or just want to get the darned thing to work so you can go to bed those are a good resource.

[–] [email protected] 4 points 8 months ago (1 children)

You know, I haven’t yet used ChatGPT for anything, I might check it out for this reason.

[–] [email protected] 4 points 8 months ago

I use it to tell me which page of the Pathfinder 1e manual I should look on for the rules I need.

[–] [email protected] 14 points 8 months ago* (last edited 8 months ago) (1 children)

It is a great book, although a bit outdated. In particular, nowadays egrep is not recommended to use. grep -E is a more portable synonim.

Some notes on you script:

  1. You don't need to escape slashes in grep regex. In the sed s/// command better use another character like s### so you also can leave slashes unescaped.

  2. You usually don't need to pipe grep and sed, sed -n with regex address and explicit printing command gives the same result as grep.

  3. You could omit leading slash in your egrep regex, so you won't need to remove it later.

So I would do the same with

tar -tzvf file.tar.gz | sed -En '/\.(mp4|mkv)$/{s#^.*/##; s#\.\[.*##; s#[^a-zA-Z0-9()&-]# #g; s/ +/ /g; p}'
[–] [email protected] 4 points 8 months ago (1 children)

nowadays egrep is not recommended to use. grep -E is a more portable synonim

Not directed at you personally, but this is the kind of pointless pedantry from upstream developers that grinds my gears.

Like, I've used egrep for 25 years. I don't know of a still relevant Unix variant in existence that doesn't have the egrep command. But suddenly now, when any other Unix variant but Linux is all but extinct, and all your shell scripts are probably full of bashisms and Linuxisms anyway, now there is somehow a portability problem, and they deem it necessary to print out a warning whenever I dare to run egrep instead of grep -E? C'mon now ... If anything, they have just made it less portable by spitting out spurious warnings where there weren't any before.

[–] [email protected] 2 points 8 months ago (1 children)

GNU grep, the most widespread implementation, does not include egrep, fgrep and rgrep for years. Distributions (not all, but many) provide shell scripts that simply run grep with corresponding option for backward compatibility. You can learn this from official documentation.

Also, my scripts are not full of bashisms, gnuisms, linuxisms and other -isms, I try to keep them portable unless it is really necessary to use some unportable command or syntax.

[–] [email protected] -1 points 8 months ago* (last edited 8 months ago) (2 children)

GNU grep, the most widespread implementation, does not include egrep, fgrep and rgrep for years. Distributions (not all, but many) provide shell scripts that simply run grep with corresponding option for backward compatibility. You can learn this from official documentation.

It seems you need to read the official documentation yourself. While it's new information to me that egrep is no longer a symlink, as it used to be a couple of years ago, but a shell script wrapper to grep -E instead, the egrep command is to this day still provided by upstream GNU grep and is installed by default if you run ./configure; make; make install from source. So it is not a backward compatibility hack provided by the distribution.

You can check for yourself. Download the source from https://ftp.gnu.org/gnu/grep/grep-3.11.tar.gz, unpack and look for src/egrep.sh or line 1756 of src/Makefile. Apparently the change from symlink to shell script was done in 2014, and the deprecation warning was added only last year.

In any case, my larger point is that the depreciation of egrep was a pointless and arbitrary decision that does not benefit users, especially not veterans like myself who have become accustomed to its presence. I don't mind change, but let's be honest, most people are not in the habit of checking the minutiae of every little command line utility they use, so a change like this violates the principle of least surprise. It's one thing if things are changed with a good reason and the users do not only suffer the inconvenience of the change but get to reap the benefits of it as well, but so far I haven't found any justification for it yet, nor can I think of any.

So if there is a portability problem with using egrep now, it's a self-inflicted portability problem that they caused by deprecating egrep in the first place.

Also, my scripts are not full of bashisms, gnuisms, linuxisms and other -isms, I try to keep them portable unless it is really necessary to use some unportable command or syntax.

Good for you. Do you want a cookie or something?

[–] [email protected] 2 points 8 months ago (1 children)

Good for you. Do you want a cookie or something?

I don't know about that guy but you need a chill-pill dude.

[–] [email protected] 1 points 8 months ago* (last edited 8 months ago)

Well he wrote it like he wanted to be applauded for it or something.

I also find the irony of your comment extremely funny ... although that's probably lost on you.

Later, dude.

[–] [email protected] 0 points 8 months ago (1 children)

It seems you need to read the official documentation yourself.

I did. Debian man page, GNU grep manual.

I'm sorry for your loss, however the egrep deprecation is a fact. Of course you can continue using it as a veteran, but it is not correct to recommend this to beginners.

[–] [email protected] 1 points 8 months ago* (last edited 8 months ago)

You are strawmanning, and your links are not countering any point I made. I never disputed the depreciation as fact, and I never recommended that beginners should use egrep over grep -E

I disputed your claims that the egrep command has just been a distro hack all these years, when in fact GNU to this day still distributes egrep through its source tarballs and only very recently started to warn about it through the wrapper script. And again, the only "portability problem" here is the fact that they deprecated it in the first place, i.e. a self-inflicted one.

Then as a Linux and Unix veteran I gave my subjective opinion by lamenting and criticizing the fact that this depreciation happened, and how changes like this always feel like unnecessary pedantry to me. Yes it's an expression of frustration, but I am allowed to feel frustrated about it. I don't need people like you invalidating how I feel about breaking changes in software that I use daily.

[–] [email protected] 11 points 8 months ago (1 children)

Just adding my congrats. Good job, OP. Regex is super useful stuff.

[–] [email protected] 2 points 8 months ago

Thank you!!!

[–] [email protected] 8 points 8 months ago (1 children)

"regex" means "regular expression", so "regex expression" means "regular expression expression".

[–] [email protected] 4 points 8 months ago

Dang! I read through my post three times to make sure I didn’t do that and completely missed that I did it right in the title. (Now fixed).

[–] [email protected] 7 points 8 months ago (2 children)

I’ll have to check out this book. Just remember HTML cannot be parsed with regex

[–] [email protected] 3 points 8 months ago* (last edited 8 months ago)

Well, technically it is possible with regex dialect that has lookarounds, but it is overcomplicated. There's really no reason to do it.

[–] [email protected] 2 points 8 months ago

Thanks for that link.

[–] [email protected] 5 points 8 months ago (2 children)

I think the most impressive part of this is that your wife cares.

...does she have a sister?

[–] [email protected] 3 points 8 months ago (1 children)

I'm currently seeing a girl I started dating after she had problems with her regex and I helped her out.

So far so good.

[–] [email protected] 1 points 8 months ago

@sab @prowess2956 @harsh3466 now you have two problems, but you don't know it yet

[–] [email protected] 1 points 8 months ago

She does but, I’d stay away from the sister. 🤣

[–] [email protected] 3 points 8 months ago* (last edited 8 months ago)

I can also recommend the book the TS mentioned, it is very good and after reading it you will understand regular expressions. It's fine to use a cheat sheet if you want, cause if you don't do it regularly the knowledge can sag, but the understanding is what matters. Also depending on the context, different implementations can have slightly different syntax or modifiers to be aware of.

I lent out the book to my brother once and he somehow lost it, so I never got it back. Don't lend out book guys.

And remember not everything can be solved using a regular expression: https://xkcd.com/1171/

[–] [email protected] 3 points 8 months ago

I stumbled upon this regex crossword puzzle a while back. I was never good enough to get it, but it seems like it could be fun.

[–] [email protected] 3 points 8 months ago (1 children)
[–] [email protected] 1 points 8 months ago (1 children)

That looks like a great way to practice

[–] [email protected] 2 points 8 months ago (1 children)

It’s definitely a way to get your regex-fu to the next level, especially if you have people to compete against.

[–] [email protected] 2 points 8 months ago (1 children)

Oh gosh. There are regex competitions out there, aren’t there.

[–] [email protected] 1 points 8 months ago

Yup, including for the largest “in production” regular expression….

[–] [email protected] 3 points 8 months ago

Congrats on your learning! I did a similar thing with music and converting all random songs to mp3

[–] [email protected] 2 points 8 months ago

I was wondering a few years ago how far you could get with implementing some simple markup syntax with just regex. Turns out, surprisingly far, but once stuff starts going wrong you're in a less than ideal situation.

https://github.com/bwachter/awfulcms/blob/master/lib/AwfulCMS/SynBasic.pm

[–] [email protected] 2 points 8 months ago (1 children)

Regexps are awesome! And also not at the same time:-P. 🎉 Congrats👏!:-)

[–] [email protected] 1 points 8 months ago
[–] [email protected] 1 points 8 months ago (1 children)

That's really cool! I know some regex and I tried to learn vim regex, only to find out it's a rabbithole so deep I'm afraid to look into. The feeling when you press enter and your carefully crafted regex does exactly what it's supposed to do is awesome though. Good luck!

[–] [email protected] 2 points 8 months ago (1 children)

Vim is on my list of things to learn. I didn’t even know vim had its own regex, but I suppose that makes sense. I’ve messed with vim a bit, but have stuck to nano so far.

[–] [email protected] 1 points 8 months ago

You might also like micro