Yesterday I walked through mapping a Synology share and pointing qgrep
at breach data so the index lives next to the dataset instead of cluttering my home directory. That’s great for portability, but there’s one more step worth calling out: making sure the index actually covers all the file types you care about.
Out of the box, qgrep
is tuned for source code search. Its default .cfg
file includes a ton of language extensions (.cpp
, .java
, .cs
, etc.), but in the breach-analysis world we don’t just see clean .json
and .csv
— we see:
.sql
dumps.tsv
,.bak
,.log
- files with no extension at all (
shadow
,passwd
,dump
) - random one-off names that would never match the defaults
If you don’t tweak the config, you’ll miss a huge percentage of what matters.
My Config
Here’s the version I landed on for my china.cfg
project, which points at Z:/breach_data/china
:
path Z:/breach_data/china
# index all files
include .*
# exclude obvious binaries
exclude \.(exe|dll|so|o|a|class|jar|pdb|jpg|jpeg|png|gif|bmp|ico|tif|tiff|mp3|mp4|avi|mov|wmv|zip|rar|7z|gz|tar|xz)$
Why This Works
include .*
→ blunt hammer. It matches literally everything, regardless of extension. No more worrying about weird.dump
files slipping through.- Exclude list → keeps your index lean. No need to burn cycles indexing images, video, archives, or executables. Those add zero value in a text-search workflow and can blow up index size fast.
This strikes a nice balance: you grab all the text-like content you care about without indexing 12GB of .mp4
memes someone dropped into the dump.
Updating the Index
After editing the config, just re-run:
qgrep update "Z:\qgrep_china_index\china.cfg"
That will rebuild using the new include/exclude filters. Once that finishes, you can go right back to searching:
qgrep search "Z:\qgrep_china_index\china.cfg" i "password"
qgrep search "Z:\qgrep_china_index\china.cfg" il "example@domain.com"
Wrap-Up
The takeaway: don’t settle for the defaults. qgrep
is lightning-fast, but only if you feed it the right filters. By flipping your config to include everything and then exclude just the obvious junk, you guarantee coverage across messy, real-world breach data without ballooning the index with useless files.