╦═╗╔═╗╔═╗╔═╗╦═╗╦ ╦╔═╗╦╦═╗ ╠╦╝║╣ ╚═╗║╣ ╠╦╝╚╗╔╝║ ║║╠╦╝ ╩╚═╚═╝╚═╝╚═╝╩╚═ ╚╝ ╚═╝╩╩╚═ ╔═╗╔═╗╔╦╗╔═╗╦ ╦╔╗╔╔═╗ ╚═╗╠═╣║║║╠═╝║ ║║║║║ ╦ ╚═╝╩ ╩╩ ╩╩ ╩═╝╩╝╚╝╚═╝ ============================================================================ Pick a random line from a file. Easy, right? my @lines = ; print $lines[rand @lines]; Load the whole file, pick one. Works fine for small files. Now do it on a 50GB log file. You can't load 50GB into memory. You don't know how many lines there are until you've read them all. You can only read forward, once. One line solves this: perl -e 'rand($.) < 1 && ($line = $_) while <>; print $line' Every line has an equal chance of being selected. Mathematically proven. Memory usage: one line, always. ============================================================================ PART 1: THE ALGORITHM --------------------- This is called reservoir sampling. Invented by computer scientists, compressed into Perl by hackers. The logic: Line 1: 1/1 chance (100%) - always keep it Line 2: 1/2 chance (50%) - maybe replace Line 3: 1/3 chance (33%) - maybe replace Line N: 1/N chance - maybe replace When you're done, every line had exactly 1/N chance of being the final selection. The math works out perfectly. ============================================================================ PART 2: HOW IT WORKS -------------------- rand($.) < 1 && ($line = $_) while <> Break it down: PIECE WHAT IT DOES ------------ ------------------------------------------ while <> Read lines one at a time $. Current line number (1, 2, 3, ...) rand($.) Random number from 0 to $. (exclusive) rand($.) < 1 True with probability 1/$. && Short-circuit: only do right side if left is true ($line = $_) Save current line (also returns true) The parentheses around the assignment matter. Without them, the precedence is wrong. .--. |o_o | |:_/ | // \ \ (| | ) /'\_ _/`\ \___)=(___/ ============================================================================ PART 3: THE PROBABILITY PROOF ----------------------------- Don't trust me? Let's prove it. For line K to be the final selection: 1. It must be selected when read: probability 1/K 2. It must NOT be replaced by any later line Probability of not being replaced by line K+1: (K)/(K+1) Probability of not being replaced by line K+2: (K+1)/(K+2) ... Probability of not being replaced by line N: (N-1)/N Multiply them all: (1/K) × (K/(K+1)) × ((K+1)/(K+2)) × ... × ((N-1)/N) Everything cancels (telescoping product): = 1/N Every line has exactly 1/N probability. QED. ============================================================================ PART 4: VARIATIONS ------------------ Select K random lines (reservoir of size K): perl -e ' @r = (); while (<>) { if (@r < 5) { push @r, $_ } elsif (rand($.) < 5) { $r[rand @r] = $_ } } print @r; ' hugefile.txt Keep 5 random lines. Same principle, bigger reservoir. Weighted random selection (by line length): perl -e ' $w += length; rand($w) < length && ($line = $_) while <>; print $line ' Longer lines are more likely to be selected. ============================================================================ PART 5: PRACTICAL USES ---------------------- Sample from logs without loading them: perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' /var/log/huge.log Random test case from a data file: perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' test_cases.txt Quick sanity check on data: for i in {1..10}; do perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' data.csv done Ten random samples. See if your data looks reasonable. ============================================================================ PART 6: MULTIPLE RANDOM LINES ----------------------------- Want more than one? Run it multiple times: for i in {1..5}; do perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' file.txt done Or use the reservoir approach for guaranteed distinct lines: perl -e ' while (<>) { if ($. <= 10) { $r[$.-1] = $_ } elsif (rand($.) < 10) { $r[rand 10] = $_ } } print @r ' file.txt Ten distinct random lines. ============================================================================ PART 7: SEEDING RANDOMNESS -------------------------- Need reproducible results? Seed the RNG: perl -e 'srand(42); rand($.) < 1 && ($l = $_) while <>; print $l' file.txt Same seed = same "random" selection. Useful for testing. ============================================================================ PART 8: COMPARISON ------------------ Three approaches to random line selection: METHOD MEMORY SPEED WORKS ON STREAMS ---------------- -------- -------- ---------------- Load into array O(N) O(N) No Two-pass (count) O(1) O(2N) No Reservoir O(1) O(N) Yes Reservoir wins on memory and works on pipes: some_command | perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' You can't seek on a pipe. Reservoir sampling doesn't need to. ============================================================================ PART 9: GOTCHAS --------------- Empty files: perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' /dev/null Prints nothing. $line is never set. Add a check if you care: perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l // "empty\n"' Very short files: echo "only line" | perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' Works fine. Line 1 has 100% chance of selection. ============================================================================ PART 10: THE ONE-LINER ---------------------- Memorize this: perl -e 'rand($.) < 1 && ($line = $_) while <>; print $line' Or shorter: perl -ne 'rand($.) < 1 && ($l = $_) }{ print $l' Butterfly version. Eight characters shorter. Or with -l for clean output: perl -lne 'rand($.) < 1 && ($l = $_) }{ print $l' ============================================================================ PART 11: WHY THIS MATTERS ------------------------- This isn't just a trick. It's a real algorithm with real applications: - Sampling from data streams - A/B testing selection - Random auditing - Load testing with random inputs - Statistical sampling from logs When you can't fit the data in memory and you can't make two passes, reservoir sampling is the answer. And Perl does it in one line. ============================================================================ rand($.) < 1 | v +---------+ | keep? | +---------+ / \ yes no / \ [save] [next] Fair selection from infinite streams ============================================================================ japh.codes