╔╦╗╦ ╦╔═╗╦ ╦╔═╗╔═╗╔╦╗╔═╗ ║║║ ║╠═╝║ ║║ ╠═╣ ║ ║╣ ═╩╝╚═╝╩ ╩═╝╩╚═╝╩ ╩ ╩ ╚═╝ ╔╦╗╔═╗╔╦╗╔═╗╔═╗╔╦╗╔═╗╦═╗ ║║║╣ ║ ║╣ ║ ║ ║ ║╠╦╝ ═╩╝╚═╝ ╩ ╚═╝╚═╝ ╩ ╚═╝╩╚═ ============================================================================ Find duplicates. One of the most common tasks in sysadmin life. perl -ne 'print if $seen{$_}++' That's it. Prints every line that appears more than once. The trick is the post-increment. First time a line appears, $seen{$_} is 0 (false), so nothing prints. Second time, it's 1 (true), so it prints. Third time, still true, prints again. Want only the second occurrence? We'll get there. ============================================================================ PART 1: HOW IT WORKS -------------------- print if $seen{$_}++ Break it down: PIECE WHAT IT DOES ------------ ------------------------------------------ $seen{$_} Hash lookup - has this line been seen? ++ Post-increment - add 1 AFTER returning value $seen{$_}++ Returns OLD value (0 first time), then increments print if ... Print if the condition is true (non-zero) First encounter: $seen{$_} is undef (0 in numeric context). Returns 0, then becomes 1. Condition false, no print. Second encounter: $seen{$_} is 1. Returns 1, then becomes 2. Condition true, print. .--. |o_o | |:_/ | // \ \ (| | ) /'\_ _/`\ \___)=(___/ ============================================================================ PART 2: VARIATIONS ------------------ Print duplicates only once (not every repeat): perl -ne 'print if $seen{$_}++ == 1' The == 1 means "only on the second occurrence." Print unique lines only (no duplicates at all): perl -ne 'print unless $seen{$_}++' Flip the logic. Print first occurrence, skip all repeats. Print lines that appear exactly N times (requires two passes): perl -ne '$c{$_}++; END { print for grep { $c{$_} == 3 } keys %c }' ============================================================================ PART 3: FIRST VS LAST OCCURRENCE -------------------------------- First occurrence of each line: perl -ne 'print unless $seen{$_}++' Last occurrence of each line: perl -ne '$last{$_} = $_; END { print values %last }' This overwrites each time, so only the last survives. But order is lost. Want last occurrence in order? perl -ne '$last{$_} = $.; END { print sort { $last{$a} <=> $last{$b} } keys %last }' Store line numbers, sort by them at the end. ============================================================================ PART 4: CASE INSENSITIVE ------------------------ Ignore case when detecting duplicates: perl -ne 'print if $seen{lc $_}++' The lc lowercases the key. "Hello" and "HELLO" are now the same. Normalize whitespace too: perl -ne '$k = lc; $k =~ s/\s+/ /g; print if $seen{$k}++' ============================================================================ PART 5: FIELD-BASED DUPLICATES ------------------------------ Duplicate detection on a specific column: perl -ane 'print if $seen{$F[0]}++' The -a splits each line into @F. This checks for duplicate first fields only. Duplicate IPs in a log: perl -ane 'print if $seen{$F[0]}++' access.log Duplicate usernames in /etc/passwd: perl -F: -ane 'print if $seen{$F[0]}++' /etc/passwd ============================================================================ PART 6: COUNTING DUPLICATES --------------------------- How many times does each line appear? perl -ne '$c{$_}++; END { print "$c{$_}: $_" for keys %c }' Output like: 3: this line appeared three times 1: this line appeared once 5: this line appeared five times Sorted by count: perl -ne '$c{$_}++; END { print "$c{$_}: $_" for sort { $c{$b} <=> $c{$a} } keys %c }' Most frequent first. ============================================================================ PART 7: ADJACENT DUPLICATES --------------------------- The uniq command only removes adjacent duplicates: perl -ne 'print unless $_ eq $last; $last = $_' Line must equal the previous line to be skipped. This is faster (no hash) but only catches consecutive repeats: aaa aaa <- removed bbb aaa <- NOT removed, not adjacent ============================================================================ PART 8: REAL WORLD EXAMPLES --------------------------- Find duplicate lines in a config file: perl -ne 'print "$.: $_" if $seen{$_}++' config.ini Includes line numbers so you can find them. Duplicate entries in /etc/hosts: perl -ane 'print if $seen{$F[1]}++' /etc/hosts Checks hostname field (second column). Duplicate SSH keys: perl -ne 'print if $seen{(split)[1]}++' ~/.ssh/authorized_keys Keys are in the second field. Finds if someone's key is listed twice. Duplicate cron jobs: crontab -l | perl -ne 'print if $seen{$_}++' ============================================================================ PART 9: MEMORY CONSIDERATIONS ----------------------------- The hash stores every unique line. For huge files with many unique lines, this eats memory. For massive files, consider: sort file.txt | uniq -d External sort handles files larger than RAM. But loses original order. Or process in chunks if you only care about recent duplicates: tail -10000 huge.log | perl -ne 'print if $seen{$_}++' ============================================================================ PART 10: THE FAMILY ------------------- These patterns are related: perl -ne 'print if $seen{$_}++' # All duplicates perl -ne 'print if $seen{$_}++ == 1' # Each duplicate once perl -ne 'print unless $seen{$_}++' # Unique lines only perl -ne '$c{$_}++; END{print for grep{$c{$_}>1}keys%c}' # Dupes, one each The post-increment idiom is the heart of all of them. ============================================================================ PART 11: WHY POST-INCREMENT --------------------------- Why $seen{$_}++ instead of ++$seen{$_}? Pre-increment returns the NEW value: ++$seen{$_} # Returns 1 on first encounter (true!) Post-increment returns the OLD value: $seen{$_}++ # Returns 0 on first encounter (false!) With pre-increment, everything prints. The ++ happens before the return. Post-increment is what makes the logic work. ============================================================================ PART 12: COMBINING WITH OTHER PATTERNS -------------------------------------- Duplicates matching a pattern: perl -ne 'print if /error/i && $seen{$_}++' Only errors, and only repeated ones. Duplicates across multiple files: perl -ne 'print "$ARGV: $_" if $seen{$_}++' *.log Shows which file the duplicate is in. Duplicates within a time window (log files): perl -ane ' $t = $F[0]; %seen = () if $t ne $last_t; print if $seen{$_}++; $last_t = $t ' timestamped.log Resets the seen hash when the timestamp changes. ============================================================================ $seen{$_}++ | +----+----+ | | first again | | skip print The post-increment trick ============================================================================ japh.codes