bizo developer blog

You can access the members of a map using square b...

2013-10-08T08:45:56.751-07:00

You can access the members of a map using square brackets. eg,

select d["timestamp"], d["id"] from sample_data_2011_12 ;

In the past, I have explicitly specified column na...

2013-10-08T04:50:09.720-07:00

In the past, I have explicitly specified column names & data types when creating hive table for importing access logs. If you create a table with the following statement as mentioned in tip 2, how do you query for individual columns?

create external table sample_data_2011_12(d map)

Hi! I needed to change the bash code a bit to make...

2013-09-15T04:09:04.970-07:00

Hi! I needed to change the bash code a bit to make the first example work:

code=$(cat <<EOF
scala.io.Source.stdin.getLines map { $@ } foreach println
EOF
)
scala -e "$code"

Great tricks, really liked the post!

I needed to change the first example to: code=$(c...

2013-09-15T04:08:01.001-07:00

I needed to change the first example to:

code=$(cat <<EOF
scala.io.Source.stdin.getLines map { $@ } foreach println
EOF
)
scala -e "$code"

to make it work. Otherwise, great blog post and great tricks :)

Grega

great explanation. Simple and to the point. Thanks...

2013-07-18T10:28:59.760-07:00

great explanation. Simple and to the point. Thanks for writing about map side aggregation technique

Hi , I need your help in this. Tell me how to set ...

2013-07-01T23:20:06.117-07:00

Hi , I need your help in this. Tell me how to set the access token in header.

I am unable to do that. I have set the network code,application name and passing the company id to get the company.But how to set the access token.

Thanks

Exactly what I was looking for. My UDTF is already...

2013-03-26T07:05:02.008-07:00

Exactly what I was looking for. My UDTF is already working. Thanks from Slovakia!

Hi pas, Sorry for the late reply, but currently w...

2013-02-11T10:01:12.492-08:00

Hi pas,

Sorry for the late reply, but currently we're not using the ec2 scripts, and instead running our Spark cluster in an Amazon EMR cluster using bootstrap actions.

We're planning on publishing these at some point, either as a blog post or github repo, but haven't yet.

They are tied to our particular setup, but would offer others a starting place if they're interested in the same approach.

Wonderful! I was going mad looking for the source ...

2013-01-30T10:29:35.856-08:00

Wonderful! I was going mad looking for the source of intermittent failures in Chef's mdadm provider on EC2. You've saved what remains of my hair.

I would love to know what (if any) changes you hav...

2013-01-23T22:43:00.279-08:00

I would love to know what (if any) changes you have made to the spark_ec2 script. As the script is now, it is far from production ready.

Thanks for the udevadm tip. I also could understan...

2012-12-27T03:52:31.266-08:00

Thanks for the udevadm tip. I also could understand why mdadm would report devices as busy, even though nothing was using (no dmraid, not device mapper).

Thank you so much! This has been a huge issue for ...

2012-09-14T13:09:05.060-07:00

Thank you so much! This has been a huge issue for us. Now it's solved.

Wow, seems like you saved my life here. Thanks a l...

2012-08-10T07:17:50.602-07:00

Wow, seems like you saved my life here. Thanks a lot !

The main reason would be that structs are a better...

2012-07-31T15:11:01.708-07:00

The main reason would be that structs are a better description of what you're actually returning. In the example, we can access the data inside of the result using "firstName" and "lastName". It might seem pretty intuitive to the developer to simply make these two fields of a string array, but what happens if the return type has a large number of fields or if there is no natural order to the fields?

The other reason is that arrays are homogeneous in Hive, so you can't return multiple types of data in a single array. With a struct, for example, you could return a firstName, lastName, age (integer-valued), aliases (array of strings), and known addresses (array of structs).

Why not return an array?

2012-07-31T14:51:14.463-07:00

Why not return an array?

I was having this same issue! Thank you!

2012-07-26T19:30:52.181-07:00

I was having this same issue! Thank you!

Very useful. I was going mad. Thanks a lot.

2012-07-22T12:18:42.571-07:00

Very useful. I was going mad. Thanks a lot.

2012-07-22T12:14:53.302-07:00

This comment has been removed by the author.

Thank you so much! It took me a while to find thi...

2012-07-20T12:26:30.999-07:00

Thank you so much! It took me a while to find this solution via google. I'm surprised it's not more prevalent around the web.

Thank a lot for the post! I had been pulling my ha...

2012-07-09T23:14:27.249-07:00

Thank a lot for the post! I had been pulling my hair out trying to solve this issue setting up an NAS.

I'd recommend having the query that loads data...

2012-02-15T08:08:17.223-08:00

I'd recommend having the query that loads data expan the object into its individual fields, as in the "output" table in the post. That way, the only time users need to be aware of the object structure is if they're using the UDF directly.

For that case, I'd recommend documenting the object format in a @Description annotation (http://hive.apache.org/docs/r0.7.0/api/org/apache/hadoop/hive/ql/exec/Description.html) on your UDF. The "value" and "extended" fields will then be available inside the Hive console via "describe function foo" and "describe extended function foo".

The problem with this though is that the hive-user...

2012-02-14T13:52:04.489-08:00

The problem with this though is that the hive-users need to have knowledge of the object before hand to properly take advantage of the system. It's not so easy for them to browse each table and find what they want. Do you have any tools to allow people to do that?

Yes, some more information on the scroll-back buff...

2012-01-13T11:12:28.670-08:00

Yes, some more information on the scroll-back buffer here:
http://www.samsarin.com/blog/2007/03/11/gnu-screen-working-with-the-scrollback-buffer/

Can you retrieve the output that happens in hive w...

2012-01-13T11:08:56.552-08:00

Can you retrieve the output that happens in hive when the screen is detached? Does it show it all when you reconnect?

Thanks Larry, really helpful. I do have 3 questio...

2012-01-04T09:27:09.266-08:00

Thanks Larry,

really helpful. I do have 3 questions and would much appreciate your feedback.

1. Why do we need a map side script? If we are just selecting 2 columns from a table and then performing some calculation in the reduce side with a script, can we not just "select" those 2 columns and then use the construct in reduce like:
select transform(kv_my_input.col1,kv_my_input.col2) using 'reduce_script' as newcol1,newcol2
from kv_my_input cluster by col1;
2. Is the map using 'map script' and reduce using 'reduce script' construct equivalent to transform ( list of columns )? If so, how do we know that the transform in question 1 will produce a "reduce" logic from your script and not the "map" logic. Or it performs whatever is inside the script?
3. Is clustering mandatory? Is it clustering which actually enables one to perform the reduce side script like you have shown? i.e. without clustering, your rows will be scattered ?

thanks again

ameet