Keep ports as your predicate.
When duplicating data, always start with duplicating the one that has the least footprint as the total impact on storage is usually insignificant. Whenever possible, I encode duplicates as ENUM’s because that usually translates to just one single additional integer value per row.
For your IP numbers, you might even create a centrally maintained lookup table. If I understand it correctly, you have about ~10k IP addresses per day with relatively low variance in the sense that you only get a few new ones relative to the previous day. If you make the table immutable, append-only, and with a mem-cache on top of it, you get among the fastest performance possible.
There are just some 65k ports in total, with a few hundred ports in actual usage per IP, so technically, all you need is integer encoding directly as property. I don’t think that’s worth normalizing.
Then, all you need is a table or graph that contains a timestamp, a key referring to the IP address in the lookup table, and the actual open Port and some kind of foreign key as reference to external data in case the IP doesn’t do the trick. I would do a comparison between queries against a table that stores the IP vs. a table the retrieves the IP from a K/V store. it makes sense to run the K/V store on the same machine as the DB to avoid a network hop. If the difference in performance is negligible, the K/V store for IP addresses is unbelievable storage efficient.
I think I would stuff all that into a TimeSeries DB with additional foreign keys to any external data stored in a graph, another DB, a CMS or whatever. In many ways, it makes sense separating time-discrete data from time-invariant data and join them ad-hoc during query time while caching all frequently accessed time-invariant data to boost queries’ latency & performance.
As long as we don’t get QuanX/3D-XPoint converged storage-memory into standard machines, you have to cache the crap out of your memory to sustain performance. Doing so would require 3D-XPoint pricing falling below DRAM but with DRAM pricing already collapsing for quite some time, this isn’t going to happen for a very long time and Intel already sells at a loss. Conversely, at this point in time, it is actually cheaper to buy more RAM and cache more.
That said, I actually use Hasura to converge data & compute services from various sources. Through that, I actually replace the foreign key inside a relational table with a remote join in Hasura over the primary key that is already there. That works across Postgres, a graph, and several web services. It works but could be better.
A few days ago, I had a conversation with some of the Hasura engineers and they say that they are working on a universal database integration layer that should be released in late Q1.
I certainly keep an eye on this because, for us, it would reduce integration time substantially mainly because you can query across all remote data sources through the unfied API exposed by Hasura.
That’s a very big deal for us.