BASH: filter huge list of numbers if they are contained in another huge list
Lets say I have two CSV files. First one is of format:
id(unique int),owner_id(non-unique int),string
It contains 50-100 millions rows. Few GBs.
Second one has format:
integer,integer
Second file contains something like billion rows. I want to get all the
rows of File 2, where both first and second column value exists somewhere
in first file second column (owner_id).
Most efficient way would be to get the unique values of the owner_id in
memory, order and do binary search for each pair from the second file. I
don't know if something like this can be done with BASH, I could do it
with python (supply a simple script the two files, it will read, load
them, and spit second file with all the valid pairs).
However I'd like to not add dependency of python, if possible.
No comments:
Post a Comment