Did you pull all 4 million at once? Are you doing a huge sorting operation 
over millions of records in memory? Then yes, you are going to use a lot of 
memory. There aren't many ways around it.

Eclipse Collections (and others) have smaller collections than java.util. 
Maybe they make a difference in your memory footprint.

Do like Gary said, use the database for that stuff. Why do "order by" and 
"group by" in java with lousy syntax when you can do it in the database? 
I'm not sure what your data set looks like though. Are these "documents" 
large blobs of text in a simple table? If so, then you are out of luck.

How to scale? Isn't that the question of our time?

Onyx or Storm could probably get you there. I think Spark can too, it just 
needs more homework. Spark goes the other direction from Hadoop because it 
assumes you will have a lot of RAM. 10GB is not that much really. A lot of 
problems go away if "your data fits in RAM". Can you get lots of RAM? You 
can get 4TB from AWS now.

https://www.awsforbusiness.com/aws-launches-biggest-instance-yet-4tb-ram/

Terabytes.

So all those in-memory thingies will work on a machine with 500GB or 1TB. 
Vertica! MemSQL! Spark! Apache Ignite! Hazelcast! All of these claim to be 
an "in memory data grid." You shove data in and run queries. Maybe you can 
get rid of MySQL? Maybe you can get rid of ES? If you can turn on the 
machine doing the work for a short amount of time maybe the cost will make 
sense. But keep in mind that RAM is getting pretty cheap and maybe you can 
scale up instead of scaling out. Scaling out gets hard pretty fast.





On Sunday, November 12, 2017 at 8:18:50 AM UTC-5, lawrence...@gmail.com 
wrote:
>
> I recently worked on a minor project that nevertheless needed to use 10 
> gigs of RAM. It ran on a reasonably powerful server, yet it taxed that 
> server. And I wondered, how are people scaling up such processes? If my 
> approach was naive, what does the less naive approach look like? 
>
> I wrote a simple app that pulled data from a MySQL database, denormalized 
> it, and then stored it in ElasticSearch. It pulled about 4 million 
> documents from MySQL. Parts of the data needed to be built up into complex 
> structures (maps, vectors) before being put into ElasticSearch. In the end, 
> the 4 million rows from MySQL became 1.5 million documents in ElasticSearch.
>
> I was wondering, what if, instead of 4 million documents, I needed to 
> process 400 million documents? I assume I would have to distribute the work 
> over several machines? I'm curious what are some of the most common routes 
> for doing so? Would this be the situation where people would start to use 
> something like Onyx or Storm or Hadoop? I looked at Spark but it seems to 
> be for a different use case, more about querying that denormalizing. 
> Likewise, dumping everything onto S3 and then using something like Athena 
> seems to be more for querying than denormalizing. 
>
> For unrelated reasons, I am moving toward the architecture where all data 
> is stored in Kafka. I suppose I could write a denormalizing app that reads 
> over Kafka and builds up the data and then inserts it to ElasticSearch, 
> though I suppose, on the narrow issue of memory usage, using Kafka is no 
> different than using using MySQL.
>
> So, I'm asking about common patterns here. When folks have an app that 
> needs more RAM than a typical server, what is the first and most common 
> steps they take? 
>
>
>
>
>
>
>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to