I have three tables:
Table 1: record when and who visited gas station or not, this contains all
the users of interest, name all the users as a set A
date | user name| visited gas station?
2013-09-01 tom yes
2013
Assume there is one large data set with size 100G on hdfs, how can I
control that every data sent into each mapper is around 10G and the 10G is
random sampled from the 100G data set? Do we have any mahout sample code
doing this?
Any comments will be appreciated.
Regards,
Hi all,
here is the question. Assume we have a table like:
--
user_id|| user_visiting_time|| user_current_web_page ||
user_previous_web_page
user 1
The table format is something like:
user_idvisiting_time visiting_web_page
user1 time11 page_string_11
user1 time12 page_string_12 with keyword 'abc'
user1 time13 page_string_13
user1 time14 page_strin
row_number, right? i am not hadoop
administrator, can I run the rank function in hive?
thanks again!
On Mon, Nov 19, 2012 at 4:55 PM, Edward Capriolo wrote:
> On Mon, Nov 19, 2012 at 4:02 PM, qiaoresearcher
> wrote:
> > The table format is something like:
> >
> >
How to run machine learning algorithms (whatever ML algorithms) directly in
Hive? assume the input and output already stored as Hive tables.
ps: I know mahout is available there, but would prefer run machine learning
algorithms directly in Hive
many thanks,