hive ngram stopword list?

While listed as one of the example use cases ... I haven't found an example of filtering out junk words (and, or, etc) from a Hive n-gram.

SELECT explode(context_ngrams(sentences(lower(description)), array("criminal", null), 10)) AS x FROM mapped_discussions;

{"ngram":["justice"],"estfrequency":274.0}
{"ngram":["behavior"],"estfrequency":121.0}
{"ngram":["law"],"estfrequency":92.0}
{"ngram":["activity"],"estfrequency":69.0}
{"ngram":["acts"],"estfrequency":41.0}
{"ngram":["procedure"],"estfrequency":35.0}
{"ngram":["and"],"estfrequency":29.0}
{"ngram":["or"],"estfrequency":27.0}
{"ngram":["case"],"estfrequency":26.0}
{"ngram":["cases"],"estfrequency":25.0}

Any ideas? Thanks!

Answers


There is an excellent post on this topic here. http://bigdatabloggin.blogspot.com/2012/08/trending-topics-in-hive-ngrams.html


Need Your Help

Setting Pre-loaded text to Black and Typed in Text to White in a UITextView

ios uitextview uicolor

I am trying to achieve something which may or may not be possible, but if it is, I have not found a way.

C# double to decimal precision loss

c# double decimal precision

I have a double "138630.78380386264" and I want to convert it to a decimal, however when I do so I do it either by casting or by using Convert.ToDecimal() and I lose precision.