How to Calculate word frequency using JAVA 8

In this article, I will show how Calculate word frequency
of a given list of strings using JAVA 8. This is helpful while analyzing
big data, we can get some phrases that most users
are using for searching.

Tools Used :

1) eclipse version Luna 4.4.1.

2) Maven 3.3.3

3) JDK 1.8

Simple steps to follow are :

1) Create a simple maven project.

2) Write a simple java program to Calculate word frequency in list of strings.

3) Run the program.

Write a simple java program to Calculate word frequency in list of strings :

WordFrequency.java is the main class that is having wordFrequency()
that calculate each work frequency in given list of strings,

WordFrequency.java

package com.devjavasource.java8;

import java.util.Arrays;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.function.Consumer;
import java.util.function.Function;
import java.util.function.Predicate;
import java.util.stream.Collectors;
import java.util.stream.Stream;

public class WordFrequency {
	public static void main(String[] args) {
		WordFrequency obj = new WordFrequency();

		final Map<String, Double> map = obj.wordFrequency(Arrays.asList(
				"Hotels in Oakland", 
				"Resorts near Sanfrancisco",
				"Restorants near Bay area", 
				"Software Jobs in Oakland",
				"Hotels in Oakland Airport",
				"Resorts near Abudabi Airport",
				"Restorants near Australia",
				"Software Jobs in USA"));
		System.out.println("JAVA 8 : WordFrequency Example");
		System.out.println("===============================");
		map.entrySet().stream().forEach(obj.printit);
	}

	/**
	 * print a Map.Entry
	 */
	private static Consumer<Map.Entry<String, Double>> printit = w -> System.out
			.printf("word: %s score:%.2f%n", w.getKey(), w.getValue());

	/**
	 * Build a Map of all words found in the list of strings along with their
	 * relative frequency in the list.
	 * 
	 * @param strings
	 * @return
	 */
	public Map<String, Double> wordFrequency(List<String> strings) {
		Stream<String> streams = strings.stream();

		// map each word to it total number of occurrences
		Map<String, Long> wordCount = streams.map(w -> w.split(" "))
				// return Stream<String[]>
				.flatMap(Arrays::stream)
				// flatten to Stream<String>
				.map(trimit)
				// strip non-alphanumerics and uppercase all
				.filter(isalpha)
				.collect(
						Collectors.groupingBy(Function.identity(),
								Collectors.counting())); // map word strings to
															// count

		// total number of words in list
		Long wordTotal = wordCount.values().stream()
				.reduce(0L, (a, b) -> a + b);

		// convert total occurrences to a percentage of total words
		Map<String, Double> wordFreq = wordCount
				.entrySet()
				.stream()
				.collect(
						Collectors.toMap(e -> e.getKey(),
								e -> (100 * (e.getValue().doubleValue()))
										/ wordTotal));

		List<Map.Entry<String, Double>> sorted = wordFreq.entrySet().stream()
				.sorted(Map.Entry.comparingByValue())
				.collect(Collectors.toList());

		Map<String, Double> sortedMap = new LinkedHashMap<String, Double>();
		sorted.forEach(e -> sortedMap.put(e.getKey(), e.getValue()));
		return sortedMap;
	}

	private Function<String, String> trimit = s -> s.replaceAll("[^A-Za-z0-9]",
			"").toUpperCase();

	private Predicate<String> isalpha = s -> s.matches("[a-zA-Z]+")
			&& s.length() > 2;
}

Run the program :

Select WordFrequency, Run As -> Java Application.

Out Put :

JAVA 8 : WordFrequency Example
===============================
word: SANFRANCISCO score:4.00
word: AUSTRALIA score:4.00
word: USA score:4.00
word: AREA score:4.00
word: BAY score:4.00
word: ABUDABI score:4.00
word: HOTELS score:8.00
word: AIRPORT score:8.00
word: SOFTWARE score:8.00
word: JOBS score:8.00
word: RESTORANTS score:8.00
word: RESORTS score:8.00
word: OAKLAND score:12.00
word: NEAR score:16.00

You can download complete Project, Here

json

*** Venkat – Happy leaning ****